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Patent Abstract Files 



File 347:JAPIO Dec 1976-2006/ 
File 350:Derwent WPDC 1963-2007 



Set Items Description 

51 441 (KERNEL? OR PRIVILEGE?) (2W) MODE? ? 

52 8401812 (ARREST? OR BAR OR BARS OR BARRICAD? OR BARRIER? OR B- 

LOCK??? OR CLOSE? OR HALT OR HALTS OR HALTED OR HINDER? OR 
HOLD? OR IMPEDE? OR IMPEDIMENT? OR OBSTACLE? OR OBSTRUCTIO- 
N? OR PREVENT? OR ROADBLOCK? OR STOP???? OR WALL?) 

53 516809 STREAM? OR INSTRUCTION? 

54 362 1 60 (EVENT? ? OR CONTEXT (2N) (SHIFT? OR SWITCH? )OR THREAD? ? 

OR SYSTEM (2W) CALL???) 

55 54591 S2(15N)S3 

56 41 S1(15N)S4 

57 79 S1(15N)S3 

58 9 S5 AND (S6 OR S7) 

59 5 S8 AND PY=1 963:2004 

510 7 S8 AND AY= 1963: 2004 

511 7 S9ORS10 

51 2 7 IDP AT (sorted in duplicate/non-duplicate order) 

5 1 3 7 IDP AT (primary/non-duplicate records only) 



1 3/3,K/4 (Item 4 from file: 350) 
DIALOG(R)File 350:Derwent WPDC 
(c) 2007 The Thomson Corporation. All rts. reserv. 



0013401596 - Drawing available 
WPI ACC NO: 2003-491774/200346 
XRPX Acc No: N2003-390629 
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Kernel streaming testing method for use in software filters in multimedia 
environment, has multiple software filters in virtual circuits receiving 
multimedia data stream from source module and analysis module 
Patent Assignee: MICROSOFT CORP (MICT) 
Inventor: CHEN Y; HAGIU C 
Patent Family (1 patents, 1 countries) 
Patent Application 

Number Kind Date Number Kind Date Update 

US 6526523 Bl 20030225 US 1998179647 A 19981027 200346 B 

Priority Applications (no, kind, date): US 1998179647 A 19981027 
Patent Details 

Number Kind Lan Pg Dwg Filing Notes 
US 6526523 Bl EN 11 4 

Alerting Abstract ...DESCRIPTION OF DRAWINGS - The diagram shows a 
block diagram of a computer system for testing software filters 
used in kernel data streaming . 

Original Publication Data by Authority 



Original Abstracts: 

...An analysis module receives the processed data stream from the chain. of 
software, filters and produces performance information as a function of the 
received data stream . The source module, the analysis module 
and each software filter operate in a kernel - mode and are 
configured by a test tool operating in non-kernel mode. Via the 
test tool, a user can select appropriate source and analysis modules as 
well as construct... 



13/3,K/5 (Item 5 from file: 350) 
DIALOG(R)File 350:Derwent WPIX 
(c) 2007 The Thomson Corporation. All its. reserv. 

0006369125 - Drawing available 
WPI ACCNO: 1993-167876 /199320 
XRPX Acc No: N1993-128473 

Fault-tolerant computer system - has all interrupts and traps occurring on 

primary computer during an epoch buffered by virtual machine monitor 

Patent Assignee: DIGITAL EQUIP CORP (DIGI) 

Inventor: BALKOVICH E; LAMPSON B; SCHNEIDER F B; THIEL D 

Patent Family (2 patents, 1 8 countries) 

Patent Application 
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. Number Kind Date Number Kind Date Update 
w WO 1993009494 Al 19930513 WO 1992US9238 A 19921027 199320 B 
US 5488716 A 19960130 US 1991783519 A 19911028 199611 E 
US 1994182311 A 19940114 

Priority Applications (no., kind, date): US 199418231 1 A 199401 14; US 
1991783519 A 19911028 

Patent Details 

Number Kind Lan Pg Dwg Filing Notes 
WO 1993009494 Al EN 39 7 
National Designated States,Original: JP KR 

Regional Designated States,Original: AT BE CH DE DK ES FR GB GR IE IT LU 
MCNLSE 

US 5488716 A EN 14 7 Continuation of application US 
1991783519 

Original Publication Data by Authority 



Original Abstracts: 

...backup computers. Primary and backup virtual machines running on the 
computers are controlled by corresponding virtual machine monitors. The 
virtual machines execute only user-mode instructions , while all 
kernel - mode instructions are trapped and 

handled by the virtual machine monitors. Each computer has a recovery 
register that generates a hardware interrupt each time that a specified 
number of instructions, called... 

...backup computers. Primary and backup virtual machines running on the 
computers are controlled by corresponding virtual machine monitors. The 
virtual machines execute only user-mode instructions , while all 
kernel - mode instructions are trapped and handled by 
the virtual machine monitors . Each computer has a 
recovery register that generates a hardware interrupt each time that a 
specified number of instructions, called an epoch, are executed. Prior... 
Claims: 

...as said interrupts sent by said primary computer, and for delivering 
said buffered interrupts and traps to said backup computer at predefined 
points in said stream of instructions executed said 
backup computer; backup epoch control means for stopping execution 
of said stream of instructions by said backup 
computer at each of said predefined points until said backup computer 
receives the control message indicating that said primary computer has 
completed execution of said stream of instructions through a next one of 
said predefined points in said stream of instructions after 
the predefined point at which said backup computer is stopped , 
such that said backup computer's execution of said stream of 
instructions always lags behind the primary computer's 



t 
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execution of said stream of instructions.Basic Derwent Week: 199320 



13/3,K/6 (Item 6 from file: 350) 
DIALOG(R)File 350:Derwent WPDC 
(c) 2007 The Thomson Corporation. All its. reserv. 

0006125044 - Drawing available 
WPIACCNO: 1992-366431/ 199244 
XRPX AccNo: N1992-279247 

Real-time 1-0 operation in vector processing computer - using rotating 
priority interrupt and dedicated real-time interrupt for each processor 
with access to privileged communication-control modes 
Patent Assignee: CRAY RES INC (CRAY) 

Inventor: FLUNKER G; LEEDOM G W; PRIBNOW R D; SCHIFFLEGER A J 
Patent Family (5 patents, 17 countries) 
Patent Application 

Number Kind Date Number Kind Date Update 

WO 1992017840 Al 19921015 WO 1991US8709 A 19911121 199244 B 

EP 577614 Al 19940112 WO 1991US8709 A 19911121 199402 E 

EP 1992904950 A 19911121 
US 5390300 A 19950214 US 1991677647 A 19910328 199512 E 

US 1994200921 A 19940222 
EP 577614 Bl 19951004 WO 1991US8709 A 19911121 199544 E 

EP 1992904950 A 19911121 
DE69113639 E 19951109 DE 69113639 A 19911121 199550 E 

WO1991US8709 A 19911121 

EP 1992904950 A 19911121 

Priority Applications (no., kind, date): US 1994200921 A 19940222; US 
1991677647 A 19910328 



Patent Details 

Number Kind Lan Pg Dwg Filing Notes 

WO 1992017840 Al EN 34 5 

National Designated States,Original: CA JP KR 

Regional Designated States.Original: AT BE CH DE DK ES FR GB GR IT LU NL 
SE 

EP 577614 Al EN 2 1 PCT Application WO 1991US8709 

Based on OPI patent WO 1992017840 
Regional Designated States,Original: AT BE CH DE DK ES FR GB GR IT LI LU 
NL SE 

US 5390300 A EN 14 5 Continuation of application US 
1991677647 

EP 577614 Bl EN 17 5 PCT Application WO 1991 US8709 

Based on OPI patent WO 1992017840 
Regional Designated States,Original: AT BE CH DE DK ES FR GB GR IT LI LU 
NL SE 

DE 69113639 E DE PCT Application WO 1991 US8709 
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Application EP 1992904950 
Based on OPI patent EP 577614 
Based on OPI patent WO 1992017840 

Equivalent Alerting Abstract ...designated processing devices is brought, 
in response to the real-time signal, into a state where the designated 
processing device is allowed access to a privileged operational 
mode, and therefore access to instructions corresponding to the 
privileged operational mode and to a system privileged 
communication channel connecting the processing device to the I/O control 
device... 

...signalling device, and signalling the designated processing device from 
the I/O control device that input data is available using an input 
signalling device. The instruction execution and signalling steps are 
executed in the privileged operational mode . 

Original Publication Data by Authority 



Claims: 

...to the real-time command, into a state wherein it is allowed to have 

access to a priveleged operational mode under control of a predetermined 

block of data being transferred from the shared memory, in order to 

have access to the instructions corresponding to the privileged 

operational mode and to the privileged communication channel (42); 

and (e) bypassing the operating system in that each 

designated processor has access to the priveleged communication channels 

(42) to read and write data thereon and to signal the I/O processor IOS... 

executing the appropriate real-time instructions; (d) bringing each of the 

designated processing means, in response to the real-time signal, into a 

state wherein the designated processing means is allowed access to a 

privileged operational mode , and therefore access to 

instructions corresponding to the privileged operational 

mode and to a system privileged communication channel 

connecting the processing means to the I/O control means; ( 

e) executing the real-time instructions in the designated 

processing means; (f) signaling an I/O control means from the designated 

processing means when output data is available using an output... 



1 3/3,K/7 (Item 7 from file: 350) 
DIALOG(R)File 350:Derwent WPDC 
(c) 2007 The Thomson Corporation. All rts. reserv. 



0005457523 - Drawing available 
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WPI ACCNO: 1991-057867/ 199108 
XRPX AccNo: N1991-044805 

Central processing unit e.g. for data processing system - has four-level 
pipeline and interface for coprocessor to provide normal instruction 
execution 

Patent Assignee: NIPPON DIGITAL EQUIP KK (DIGI) 
Inventor: LEONARD J S; WILHELM N C 
Patent Family (1 patents, 1 countries) 
Patent Application 

Number Kind Date Number Kind Date Update 

US 4991078 A 19910205 US 1987101983 A 19870929 199108 B 

Priority Applications (no., kind, date): US 1987101983 A 19870929 

Alerting Abstract ...The central processing unit has a four stage 
pipeline, operating with a load/store procedure operation, having a 
simplified instruction on set and providing a user and kernel 
mode of operation. The central processing unit has a data and 
instruction word size of constant (32 bit) width. The addressing 
technique includes a virtual addressing scheme the virtual addressing 
scheme is also the mechanism for separation... 

Original Publication Data by Authority 



Original Abstracts: 

...the central processing system, a simplified instruction set and an 
interface with the coprocessor unit that has a simple and efficient 
interface with the normal instruction execution. The apparatus 
implementing the central processing system is closely connected 
to the instruction set. A discussion of the implementation of 
the data processing system is provided. ... 
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Patent Fulltext Files 



File 348:EUROPEAN PATENTS 1978-2007 
File 349:PCT FULLTEXT 1979-2007 

Set Items Description 

51 936 (KERNEL? OR PRIVILEGE?) (2W) MODE? ? 

52 2028881 (ARREST? OR BAR OR BARS OR BARRICAD? OR BARRIER? OR B- 

LOCK??? OR CLOSE? OR HALT OR HALTS OR HALTED OR HINDER? OR 
HOLD? OR IMPEDE? OR IMPEDIMENT? OR OBSTACLE? OR OBSTRUCTIO- 
N? OR PREVENT? OR ROADBLOCK? OR STOP???? OR WALL?) 

53 464360 STREAM? OR INSTRUCTION? 

54 652271 (EVENT? ? OR CONTEXT (2N) (SHIFT? OR SWITCH? )OR THREAD? ? 

OR SYSTEM (2W) CALL???) 

513 45 S5 AND S6 AND S7 

514 38 S13 AND PY=1 963:2004 

515 43 S13 AND AY=1963:2004 

516 45 S13 0RS14 

51 7 45 IDP AT (sorted in duplicate/non-duplicate order) 
SI 8 44 IDP AT (primary/non-duplicate records only) 



1 8/3,K/7 (Item 7 from file: 348) 
DIALOG(R)File 348 EUROPEAN PATENTS 
(c) 2007 European Patent Office. All its. reserv. 

01605021 

Methods and system for managing computational resources of a coprocessor in 
a computing system 

PATENT ASSIGNEE: 
MICROSOFT CORPORATION, (749868), One Microsoft Way, Redmond, WA 98053, 
(US), (Applicant designated States: all) 
INVENTOR: 

Wilt, Nicholas P., 1920-205th Place Northeast, Sammamish, Washington 
98074, (US) 

Nene, Sameer A., 15816 N.E. 46th Ct.„ Redmond, Washington 98052, (US) 
Bede, Joseph S., Ill, 3819 Densmore Avenue N, Seattle, Washington 98103, 
(US) 

LEGAL REPRESENTATIVE: 
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Grunecker, Kinkeldey, Stockmair & Schwanhausser Anwaltssozietat (100721) 
, Maximilianstrasse 58, 80538 Munchen, (DE) 
PATENT (CC, No, Kind, Date): EP 1326165 A2 030709 (Basic) 
APPLICATION (CC, No, Date): EP 2003000153 030103; 
PRIORITY (CC, No, Date): US 39036 020104 

DESIGNATED STATES: AT; BE; BG; CH; CY; CZ; DE; DK; EE; ES; FI; FR; GB; GR; 
HU; IE; IT; LI; LU; MC; NL; PT; SE; SI; SK; TR 
EXTENDED DESIGNATED STATES: AL; LT; LV; MK; RO 
INTERNATIONAL PATENT CLASS (V7): G06F-009/38 
ABSTRACT WORD COUNT: 92 
NOTE: 

Figure number on first page: 1 A 

LANGUAGE (Publication,Procedural,Application): English; English; English 

...SPECIFICATION primitive is an object that can be used to synchronize 
multiple threads' access to shared resources, such as critical sections, 
mutexes, semaphores or events. 
A thread is an executable entity that comprises a program 
counter, a user-mode stack, a kernel - mode stack and a set of 
register values. 

A token stream is a stream of hardware-independent tokens 
that describe a series of drawing operations. A token stream can be 
translated by a hardware-specific software component, such as... 
inefficient and generally does not scale well with multiple instances of 
a given piece of hardware in a system. As a rule, IN and OUT 
instructions can be executed only in kernel mode , the 
microprocessor mode that allows direct manipulation of hardware. If a 
user mode thread encounters such an instruction, the system 
generates an exception, which, in the case of a user mode thread 
attempting to control hardware, usually results in... application activity 
into DrawPrimitives2 ("DP2") tokens. When the DP2 token stream is 
submitted, a kernel transition occurs and the driver 314 translates the 
DP2 token stream into hardware-specific commands in kernel 
mode . Fig. 3B does not make any assumptions about whether the 
driver 3 14 is in user mode or kernel mode, and in this regard, driver 
component... non-writeable. Memory writes that overrun the buffer would 
then cause an exception. User mode code writing into the command buffer 
could surround the write instructions with a structured exception 
handling block, field the exception and return an error to the 
runtime if the command buffer overflows. The interface to the user mode 
driver would then be.. .commands' DDI counterparts into hardware-specific 
commands and write them into a temporary buffer or buffers (since the 
exact size of the hardware-specific command stream is not known 
until after the translation has been performed). When the recording is 
stopped, or once the token stream has been parsed, the system 
could then allocate a command buffer or command buffers of suitable size 
and copy the translated hardware commands into them... could be used. The 
system of Fig. 7 closely resembles the state of the art in DIRECT3D(R) 
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driver structure, in that a DrawPrimitives2 token stream is passed 

to the kernel mode driver 705. The main difference is that 

the system of Fig. 7 contemplates OS-arbitrated scheduling of the command 

buffers, while DIRECT3D(R) currently does... 

...CLAIMS including preempting by the at least one coprocessor upon the 
occurrence of an external event. 

19. A method according to claim 18, wherein the external event is 
the operating system making a call to a corresponding 
kernel mode driver object to preempt the at least one 
coprocessor. 

20. A method according to claim 1 8, wherein the host processor is 
interrupted to coordinate scheduling... at least one coprocessor upon 
the occurrence of an external event. 

45. At least one computer readable medium according to claim 44, wherein 
the external event is the operating system making a 

call to a corresponding kernel mode driver object 
to preempt the at least one coprocessor. 

46. At least one computer readable medium according to claim 27, wherein 
the host processor is... preempted by the at least one coprocessor 

upon the occurrence of an external event. 

70. A computing device according to claim 69, wherein the external 
event is the operating system making a call to a 
corresponding kernel mode driver object to preempt the at 
least one coprocessor. 

71 . A computing device according to claim 52, wherein the host processor 
is interrupted to coordinate... 



1 8/3,K/8 (Item 8 from file: 348) 
DIALOG(R)File 348:EUROPEAN PATENTS 
(c) 2007 European Patent Office. All rts. reserv. 

01059137 

METHOD AND APPARATUS FOR SELECTING THREAD SWITCH EVENTS IN A 
MULTITHREADED PROCESSOR 

PATENT ASSIGNEE: 
International Business Machines Corporation, (200128), New Orchard Road, 
Armonk, NY 10504, (US), (Proprietor designated states: all) 
INVENTOR: 

BORKENHAGEN, John, Michael, 1359 Westhill Drive S.W., Rochester, MN 55902 
>(US) 

EICKEMEYER, Richard, James, 5277 Howard Street N.W., Rochester, MN 55901, 
(US) 

FLYNN, William, Thomas, 2516 14th Avenue S.W., Rochester, MN 55902, (US) 
LEVENSTEIN, Sheldon, Bernard, 1608 7th Street N.E., Rochester, MN 55906, 
(US) 

WOTTRENG, Andrew, Henry, 4224 Manor View Drive N.W., Rochester, MN 55901, 
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(US) 

LEGAL REPRESENTATIVE: 
Burt, Roger James, Dr. et al (52152), IBM United Kingdom Limited 
Intellectual Property Department Hursley Park, Winchester Hampshire 
S021 2JN, (GB) 

PATENT (CC, No, Kind, Date): EP 1029269 Al 000823 (Basic) 

EP 1029269 Bl 030924 

WO 99021081 990429 
APPLICATION (CC, No, Date): EP 98953515 981014; W0 98US21716 981014 
PRIORITY (CC, No, Date): US 958716 971023 
DESIGNATED STATES: GB; IE 

INTERNATIONAL PATENT CLASS (V7): G06F-009/38; G06F-009/46 
NOTE: 

No A-document published by EPO 

LANGUAGE (Publication,Procedural,Application): English; English; English 
...SPECIFICATION processing system according to the present invention. 

Figure 3 illustrates a block diagram of the storage control unit of 
Figure 2. 

Figure 4 illustrates a block diagram of the thread switch logic, 
the storage control unit and the instruction unit of Figure 2. 

Figure 5 illustrate the changes of state of a thread as the thread 
experiences different thread switch events shown in Figure... which allows 
up to a programmable maximum number of thread switches called the forward 
progress threshold value. After that maximum number of thread switches, 
an instruction must be completed before switching can occur again. 
In this way, thrashing is prevented . Forward progress count 
register 420 may actually be bits 30:31 in the thread switch control 
register 410 or a software programmable forward progress threshold... 

...610, bits 15:17 in thread state register 442 pertaining to thread TO are 
reset to state 111. Execution of this thread is attempted in block 
620 and the state changes to 000. If an instruction successfully 
executes on thread TO, the state of thread TO returns to 1 1 1 and remains 
so. If, however, thread TO cannot execute an instruction, a.. .itself to 
change the priority of itself: 
tsop 1 or 1,1,1 - Switch to dormant thread 
tsop 2 or 1,1,1 - Set active thread to LOW priority 

- Switch to dormant thread 

- NOTE: Only valid in privileged mode unless TSC(19)=1 
tsop 3 or 2,2,2 - Set active thread to MEDIUM priority 
tsop 4 or 3,3,3 - Set active thread to HIGH priority 

- NOTE: Only valid in privileged mode 

Instructions tsop 1 and tsop 2 can be the same instruction 
as embodied herein as or 1,1,1 but they can also be separate 
instructions. These instructions interact with bits 19 and 21 of the... 
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...thread switch control register 320 is zero, the priority for both 
threads is set to medium and the effect of the or x,x,x 
instructions on the priority is blocked . If an external 
interrupt request is active, and if the corresponding thread's priority 
is low, that thread's priority is set to medium. 
The... 



1 8/3.K/9 (Item 9 from file: 348) 
DIALOG(R)File 348:EUROPEAN PATENTS 
(c) 2007 European Patent Office. All rts. reserv. 

00959156 

Method and computer program product for interconnecting software drivers in 
kernel mode 

PATENT ASSIGNEE: 
MICROSOFT CORPORATION, (749861), One Microsoft Way, Redmond, Washington 
98052-6399, (US), (applicant designated states: 
AT;BE;CH;DE;DK;ES;FI;FR;GB;GR;IE;IT;LI;LU;MC;NL;PT;SE) 

INVENTOR: 

Shaw, George H. J., 21213 NE 186 th Street, Woodinville WA 98072, (US) 

O'Rourke, Thomas, Riekonmarjatie 29, 90800 Oulu, (FI) 

Woodruff, Bryan A., 1 020 SW 1 0th Street, New Bend WA 98045, (US) 
LEGAL REPRESENTATIVE: 

Belcher, Simon James et al (5831 1), Urquhart-Dykes & Lord Tower House 
Merrion Way, Leeds LS2 8PA, (GB) 
PATENT (CC, No, Kind, Date): EP 871 1 14 A2 981014 (Basic) 

EP 871114 A3 981216 
APPLICATION (CC, No, Date): EP 97304296 970619; 
PRIORITY (CC, No, Date): US 825856 970404 
DESIGNATED STATES: DE; FR; GB 

INTERNATIONAL PATENT CLASS (V7): G06F-009/46; G06F-013/10; 
ABSTRACT WORD COUNT: 153 

LANGUAGE (Publication,Procedural,Application): English; English; English 

FULLTEXT AVAILABILITY: 

Available Text Language Update Word Count 

CLAIMS A (English) 9842 895 

SPEC A (English) 9842 11279 
Total word count - document A 1 2 1 74 
Total word count - document B 0 
Total word count - documents A + B 1 2 1 74 

...SPECIFICATION of easily using kernel mode drivers, used throughout this 
application, comprises graph building functionality that allows a user to 
select and connect together different processing blocks , called 
filters, to successively manipulate a stream of multimedia data. 
The data typically is a series of samples representing sound or video, 
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and the processing blocks may include decompression processing for 
compressed... 

...so that the graph builder portion of the program may interconnect and 
control their operation and be responsive to user input and rearrangement 
of processing blocks . Because of the consistent stream nature 
of multimedia data and the generation of large quantities of data, 
performance is a critical issue. In a general purpose operating system, 
system performance... 44 will query the drivers known in order to then 
make interconnections according to data format and connection format to 
effectuate the rendering entirely in kernel mode . 
Furthermore, the controlling agent will receive notifications of 
important events so that it may exercise control as necessary. 
Examples of such events would include end of processing, a data 
starvation situation, etc. 

In this configuration... discussion purposes hereinafter transitioning 
to the acquire state will accomplish the same purpose with respect to 
stack depth parameter adjustment as transitioning out of the stop 
state. 

Once all pin instances are in the acquire state, stream reads and 
writes may be issued to the filter graph. It is interesting to note that 
the system explained herein allows connection of associated input... 
output pin instance 208 to finalize the connection. 

At step 197, the third party controlling agent 170 will transition each 
connection pin instance from the stop state to the acquire state in 
preparation for streamed data processing through the filter graph. 
To correctly set the stack depth parameter in each of the device objects 
for the respective filters, it is... 

...CLAIMS a plurality of kernel mode data processing components including 
an originating component and a terminating component, the originating 
component reading data samples of a data stream from the data 
source; and 

kernel mode component interconnections between the data 
processing components to route the data samples from the originating 
component to the terminating component. 

1 1 . A data processing system... 



18/3,K/10 (Item 10 from file: 348) 
DIALOG(R)File 348:EUROPEAN PATENTS 
(c) 2007 European Patent Office. All its. reserv. 

00957724 

Method and computer program product for synchronizing the processing of 
multiple data streams and matching disparate processing rates using a 
standardized clock mechanism 

PATENT ASSIGNEE: 
MICROSOFT CORPORATION, (749861), One Microsoft Way, Redmond, Washington 
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98052-6399, (US), (Applicant designated States: all) 
INVENTOR: 

Shaw, George H.J., 21213 NE 186th Street, Woodinville, WA 98072, (US) 

O'Rourke, Thomas J., Riekonmarjatie 29, 90800 Oulu, (FI) 

Woodruff, Bryan A., 1020 SW 10th Street, New Bend, WA 98045, (US) 
LEGAL REPRESENTATIVE: 

Alton, Andrew et al (97091), Urquhart-Dykes & Lord Tower House Merrion 
Way, Leeds LS2 8PA, (GB) 
PATENT (CC, No, Kind, Date): EP 869443 A2 981007 (Basic) 

EP 869443 A3 010509 
APPLICATION (CC, No, Date): EP 97304294 970619; 
PRIORITY (CC, No, Date): US 826560 970404 
DESIGNATED STATES: DE; FR; GB 
EXTENDED DESIGNATED STATES: AL; LT; LV; RO; SI 
INTERNATIONAL PATENT CLASS (V7): G06F-017/30 
ABSTRACT WORD COUNT: 227 
NOTE: 

Figure number on first page: 2 

LANGUAGE (Publication,Procedural,Application): English; English; English 

FULLTEXT AVAILABILITY: 

Available Text Language Update Word Count 

CLAIMS A (English) 9841 1278 

SPEC A (English) 9841 21034 
Total word count - document A 22312 
Total word count - document B 0 
Total word count - documents A + B 223 12 



...SPECIFICATION stream. 

Another method of synchronization utilizes a clock mechanism that 
provides a time value that is based on presentation timestamp values on a 
"master" data stream while the stream is being processed. 
When the master stream is paused or otherwise stopped , the 
time value alternates to be based off a physical or hardware clock such 
as the PC clock. While the processor of a second data... 

...on the physical clock. 

Since event notification is based on the physical clock, such events 
may continue to occur even when processing is paused or stopped on 
the master data or media stream . Notifications may be temporarily 
disabled but at the cost of extra processing overhead and increased 
processing code complexity. Even when notifications are specifically 
disabled, timing easily using kernel mode drivers, used throughout this 
application, comprises graph building functionality that allows a user to 
select and connect together different processing blocks , called 
filters, to successively manipulate a stream of multimedia data. 
The data typically is a series of samples representing sound or video and 
the processing blocks may include decompression processing for compressed 
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...so that the graph builder portion of the program may interconnect and 
control their operation and be responsive to user input and rearrangement 
of processing blocks . Because of the consistent stream nature 
of multimedia data and the generation of large quantities of data, 
performance is a critical issue. In a general purpose operating system, 
system performance... 

...clock mechanism is provided. The clock mechanism can be implemented in a 
variety of different ways and is explained throughout in the context of 
interconnectable kernel mode drivers providing efficient data 
stream processing. 

The clock mechanism of the present invention maintains three different 
time values and makes them available to other components. The time values 
maybe... 

...can be used by other processing components processing other data streams 
for synchronization. Since the positional time value is based on 
processing of a data stream , when the processing of the underlying 
data stream stops the clock does as well. Furthermore, timing 
event notifications will also be frozen on an absolute time basis, but 
will continue to be perfectly timed... showing the operation of buffer 
allocators and the actual data transferring between buffers for the 
system shown in Figure 12. 

Figure 14 is a logical block diagram illustrating how two live 
data streams can be synchronized to a single master clock 
mechanism. 

Figures 15A and 15B are logical block diagrams showing the live 
audio system of Figure 14 as implemented using the interconnected filter 
system explained in detail in connection with Figure 12. Figure.. .44 will 
query the drivers known in order to then make interconnections according 
to data format and connection format to effectuate the rendering entirely 
in kernel mode . Furthermore, the controlling agent will 
receive notifications of important events so that it may exercise 
control as necessary. Examples of such events would include end of 
processing, a data starvation situation, etc. 

In this configuration... discussion purposes hereinafter transitioning 
to the acquire state will accomplish the same purpose with respect to 
stack depth parameter adjustment as transitioning out of the stop 
state. 

Once all pin instances are in the acquire state, stream reads and 
writes may be issued to the filter graph. It is interesting to note that 
the system explained herein allows connection of associated input... 
output pin instance 208 to finalize the connection. 

At step 197, the third party controlling agent 170 will transition each 
connection pin instance from the stop state to the acquire state in 
preparation for streamed data processing through the filter graph. 
To correctly set the stack depth parameter in each of the device objects 
for the respective filters, it is... call (which may be done during a 
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DPC), the allocator returns a pointer to the frame if one is available or 
returns null immediately. The kernel mode requester may then 
wait for a free event notification to know that there is a free 
frame available. Upon receipt of this notification, the kernel mode 
requester will attempt the allocate request again.. .synchronized. 

In order to accomplish this, one data stream is chosen to be the 
master. For purposes of this disclosure and explanation, the live audio 
stream 298 has been arbitrarily chosen to be the master data 
stream . Note that nothing prevents the live video 
stream 308 from being chosen as the master or any other 
stream of data for that matter. Furthermore, an external clock, 
based on a data stream entirely external to the filter graph, may be used 
to synchronize based on the time interval information in the live audio 
stream 298, no overhead is necessary to manage stopped or 
paused states since media time advances ("ticks") only when the live 
audio stream 298 is being processed at the audio Tenderer 296. 

It should be noted that the video stream 308 may also need to be rate 
matched... instances and output pin instances in order to make the 
interconnections between the respective filters and it is the controlling 
agent that makes the entire kernel mode filter graph 
topology. 

The live audio stream 352 is initially brought into the filter 
graph topology by the audio reader filter 354 as represented by arrow 
356. The data is decompressed by... architecture while allowing unbounded 
ability for the filter developer to customize and add additional 
capabilities. 

While a property set allows other processing components, such as 
kernel mode filters, to query relevant time properties, clock 
mechanisms may also support an event set so that notifications may 
be sent directly to interested clock mechanism clients by way of event 
notification IRPs. The handle of a file object... 

...clock mechanism will know where to send event notifications. Optionally, 
direct function call handles may be used so that DPC level processing can 
occur between kernel mode entities for achieving higher 
performance. 

Table 5 below, a number of possible notification events make up 
an event set that can be supported by compliant drivers. A third party 
controlling agent may interconnect or register those entities for 
receiving... 

...336 of the video Tenderer filter 332 as a slave processing component. 
Note also, that the media time value used depends upon the master data 
stream processing. Should media rendering be paused or 
stopped , then time would effectively stop for synchronization 
purposes. 

The slave processing component or filter will compare the master clock 
media time value to the slave data stream media time value... the local 
host time rather than the PC time. 
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Applying the method for translation illustrated in the flow chart of 
Figure 16C to the interconnected kernel mode filters for 
processing and rendering a live audio and video stream as shown in 
Figures 1 5 A and 1 5B, it would be the input pin instance 336 on the video 
renderer filter 332 that would receive the... 

...CLAIMS A method as claimed in claim 17, in which the designated time 

value is a physical time value based upon a hardware oscillator. 
20. A kernel mode data processing system allowing synchronous 

processing of two data streams comprising: 
a first data source; 

a first plurality of kernel mode data processing components 
including an originating component and a terminating component, the 
originating component reading data samples of a first data stream 
originating from the... 
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...SPECIFICATION mapped. However, other associativities are possible. 
The invention also provides a computer system comprising: 
a processor for running a process by executing a sequence of 
instructions ; 

a main memory which holds said instructions and data for 
said instructions ; and 

a cache connected in a memory access path between the processor and 
the main memory and having a plurality of storage locations, wherein a... 

...in main memory are held at that location in the cache. 

In the preferred embodiment, the processor has a user mode of operation 
and a privileged (kernel) mode of operation. Cache 
coherency instructions are executable in the user mode. 

For a better understanding of the present invention and to show how the 
same may be carried into effect... have their partition indicator bits set 
for that partition. However, a user may alter partitions by requesting 
that the cache partitions be altered. In that event , the CPU 2 
would change to kernel mode to implement the request, change 
the TLB entries accordingly and then return to the user mode to allow the 
user to continue. Thus, a user.. .single cache access circuit which 
performs both look-up and refill. 

As can be seen in Figures 2 and 8, each location in the cache 
holds an address in main memory and the item (data or 
instruction) from that address in main memory. It is not necessary 
for the whole of the memory address to be held at the cache location. For 



...CLAIMS or each cache partition, when present, is direct mapped. 

10. A computer system comprising: 

a processor for running a process by executing a sequence of 
instructions ; 

a main memory which holds said instructions and data for 

said instructions ; and 
a cache connected in a memory access path between the processor and the 

main memory and having a plurality of storage locations, wherein a... 

...at that location in the cache. 

1 1 . A computer system according to claim 10, wherein the processor has a 
user mode of operation and a privileged mode of operation 

and wherein the cache coherency instruction is executed in the 
user mode. 

12. A method of modifying the coherency status of the contents of a cache 
with respect to items held... 
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s ...SPECIFICATION local bus are connected to pipeline and bus control 
circuitry 87, driven from the internal bus structure 81 and the control 
logic 82. 

The microprocessor block 75 in the processor 40 is of the RISC 
type in that most instructions execute in one machine cycle, and 
the instruction set uses register-to-register and load/store instructions 
rather than having complex instructions involving memory references... 
more difficult, and would degrade performance; also, use of the write 
buffer 50 serves to decouple the processors, and would be much less 
effective with close coupling of the processors. Likewise, the 
high-performance resulting from using instruction and data caches, 
and virtual memory management with the TLBs 83, would be more difficult 
to implement if close coupling were used, and performance would... 
provided in many machines. The processor 40 is configured to operate 
normally in the user mode until an exception is detected forcing it into 
the kernel mode , where it remains until a restore from 
exception (RFE) instruction is executed. The manner in which the 
memory addresses are translated or mapped depends upon the operating mode 
of the microprocessor, which is defined by... 

...The kuseg 150 segment for the kernel mode is 2-GByte in size, coincident 
with the "kuseg" of the user mode, so when in the kernel mode the 
processor treats references to this segment just like user mode 
references, thus streamlining kernel access to user data. The kuseg 
150 is used to hold user code and data, but the operating system 
often needs to reference this same code or data. The ksegO area 152 is a 
512-MByte... 

...This kseg2 area differs from the kuseg area 1 50 only in that it is not 
accessible in the user mode, but instead only in the kernel 
mode . The operating system uses kseg2 for stacks and per-process 
data that must remap on context switches , for user page 
tables (memory map), and for somedynamically-allocated data areas. Kseg2 
allows selective caching and mapping on a per page basis, rather than... 
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...SPECIFICATION local bus are connected to pipeline and bus control 
circuitry 87, driven from the internal bus structure 81 and the control 
logic 82. 

The microprocessor block 75 in the processor 40 is of the RISC 
type in that most instructions execute in one machine cycle, and 
the instruction set uses register-to-register and load/store instructions 
rather than having complex instructions involving memory references- 
more difficult, and would degrade performance; also, use of the write 
buffer 50 serves to decouple the processors, and would be much less 
effective with close coupling of the processors. Likewise, the 
high-performance resulting from using instruction and data caches, 
and virtual memory management with the TLBs 83, would be more difficult 
to implement if close coupling were used, and performance would., .then 
presents the interrupt to the processor chip 40. The result is that all 
of the processor chips received the external interrupt request at 
identical instructions , and the information saved in the 
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holding registers is not needed. 
Holding Register: 

In the interrupt scenario presented above with reference to Figure 15, 
the voter 136 uses a holding register 138 to save some state information 
...provided in many machines. The processor 40 is configured to operate 
normally in the user mode until an exception is detected forcing it into 
the kernel mode , where it remains until a restore from 
exception (RFE) instruction is executed. The manner in which the 
memory addresses are translated or mapped depends upon the operating mode 
of the microprocessor, which is defined by... 

...The kuseg 150 segment for the kernel mode is 2-GByte in size, coincident 
with the "kuseg" of the user mode, so when in the kernel mode the 
processor treats references to this segment just like user mode 
references, thus streamlining kernel access to user data. The kuseg 
1 50 is used to hold user code and data, but the operating system 
often needs to reference this same ...This kseg2 area differs from the 
kuseg area 150 only in that it is not accessible in the user mode, but 
instead only in the kernel mode . The operating system uses 
kseg2 for stacks and per-process data that must remap on context 
switches , for user page tables (memory map), and for some 
dynamically-allocated data areas. Kseg2 allows selective caching and 
mapping on a per page basis, rather... 



1 8/3,K/14 (Item 14 from file: 348) 
DIALOG(R)File 348:EUROPEAN PATENTS 
(c) 2007 European Patent Office. All rts. reserv. 

00426745 

Fault-tolerant computer system with/config filesystem. 

Fehlertolerantes Rechnersystem mit Online-Wiedereinfugung und 

Abschaltung/Start. 
Systeme de calculateur tolerant les fautes avec reintegration et 

arret-d'urgence/reprise en ligne. 
PATENT ASSIGNEE: 
TANDEM COMPUTERS INCORPORATED, (524031), 10435 N. Tantau Avenue, 

Cupertino, California 95014-0709, (US), (applicant designated states: 

AT;BE;CH;DE;DK;ES;FR;GB;GR;IT;LI;LU;NL;SE) 
INVENTOR: 

Jewett, Douglas E., 12401 Wycliff Lane, Austin, Texas 78727, (US) 

Bereiter, Tom, 4613 Shoalwood, Austin, Texas 78756, (US) 

Vetter, Brian, 8501 Potosi Cove, Austin, Texas 78717, (US) 

Banton, Randall G., 1805 Stoneridge Road, Austin, Texas 78746, (US) 

Cutts, Richard W., Jr., 1312 Elm Street, Georgetown, Texas 78626, (US) 

Westbrook, Donald C, , deceased, (US) 

Fey, Kyran W., Jr., 507 Hickory Ridge Trail, Pfluggerville, Texas 78660, 
(US) 

Pozdro, John, 4617 Sinclair Avenue, Austin, Texas 78756, (US) 



Page 22-80 10/643769 
Debacker, Kenneth C, 1 5702 Monona Cove, Austin, Texas 787 1 7, (US) 
Mehta, Nikhil A., 1715 Prairie Hen Cove, Austin, Texas 78752, (US) 
Webster, Phil, 9743 Anderson Village Dr., Austin, Texas 78758, (US) 
Aldridge, Dave, 20542 Highland Lake Drive, Lage Vista, Texas 78645, (US) 
Norwood, Peter C, 2200 Klattenhoff Dr., Austin, Texas 78728, (US) 

LEGAL REPRESENTATIVE: 
Liesegang, Roland, Dr.-Ing. et al (7741), FORRESTER & BOEHMERT 
Franz-Joseph-Strasse 38, W-8000 Munchen 40, (DE) 

PATENT (CC, No, Kind, Date): EP 433979 A2 910626 (Basic) 
EP 433979 A3 930526 

APPLICATION (CC, No, Date): EP 90124582 901218; 

PRIORITY (CC, No, Date): US 455065 891222; US 455127 891222; US 455218 

891222; US 461250 900105; US 461402 900105 
DESIGNATED STATES: AT; BE; CH; DE; DK; ES; FR; GB; GR; IT; LI; LU; NL; SE 
INTERNATIONAL PATENT CLASS (V7): G06F-011/16; G06F-011/14; 
ABSTRACT WORD COUNT: 356 

LANGUAGE (Publication,Procedural, Application): English; English; English 

EULLTEXT AVAILABILITY: 

Available Text Language Update Word Count 

CLAIMS A (English) EPABF 1 43 1 7 

SPEC A (English) EPABF 1 29454 
Total word count - document A 3377 1 
Total word count - document B 0 
Total word count - documents A + B 33771 

...SPECIFICATION local bus are connected to pipeline and bus control 
circuitry 87, driven from the internal bus structure 81 and the control 
logic 82. 

The microprocessor block 75 in the processor 40 is of the RISC 
type in that most instructions execute in one machine cycle, and 
the instruction set uses register-to-register and load/store instructions 
rather than having complex instructions involving memory references... 
more difficult, and would degrade performance; also, use of the write 
buffer 50 serves to decouple the processors, and would be much less 
effective with close coupling of the processors. Likewise, the 
high-performance resulting from using instruction and data caches, 
and virtual memory management with the TLBs 83, would be more difficult 
to implement if close coupling were used, and performance would... 
provided in many machines. The processor 40 is configured to operate 
normally in the user mode until an exception is detected forcing it into 
the kernel mode , where it remains until a restore from 
exception (RFE) instruction is executed. The manner in which the 
memory addresses are translated or mapped depends upon the operating mode 
of the microprocessor, which is defined by... The kuseg 150 segment for 
the kernel mode is 2-GByte in size, coincident with the "kuseg" of the 
user mode, so when in the kernel mode the processor treats 
references to this segment just like user mode references, thus 
streamlining kernel access to user data. The kuseg 150 is used to 
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. hold user code and data, but the operating system often needs to 
reference this same code or data. The ksegO area 152 is a 512-MByte.. 
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...This kseg2 area differs from the kuseg area 150 only in that it is not 
accessible in the user mode, but instead only in the kernel 
mode . The operating system uses kseg2 for stacks and per-process 
data that must remap on context switches , for user page 
tables (memory map), and for some dynamically-allocated data areas. Kseg2 
allows selective caching and mapping on a per page basis, rather... 
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...SPECIFICATION local bus are connected to pipeline and bus control 
circuitry 87, driven from the internal bus structure 81 and the control 
logic 82. 

The microprocessor block 75 in the processor 40 is of the RISC 
type in that most instructions execute in one machine cycle, and 
the instruction set uses register-to-register and load/store 
instructions rather than having complex instructions involving memory 
...more difficult, and would degrade performance; also, use of the write 
buffer 50 serves to decouple the processors, and would be much less 
effective with close coupling of the processors. Likewise, the 
high-performance resulting from using instruction and data caches, 
and virtual memory management with the TLBs 83, would be more difficult 
to implement if close coupling were used, and performance would... then 
presents the interrupt to the processor chip 40. The result is that all 
of the processor chips received the external interrupt request at 
identical instructions , and the information saved in the 
holding registers is not needed. 
Holding Register: 

In the interrupt scenario presented above with reference to Figure 15, 
the voter 136 uses a holding register 138 to save some state information 
...provided in many machines. The processor 40 is configured to operate 
normally in the user mode until an exception is detected forcing it into 
the kernel mode , where it remains until a restore from 
exception (RFE) instruction is executed. The manner in which the 
memory addresses are translated or mapped depends upon the operating mode 
of the microprocessor, which is defined by... 

...The kuseg 150 segment for the kernel mode is 2-GByte in size, coincident 
with the "kuseg" of the user mode, so when in the kernel mode the 
processor treats references to this segment just like user mode 
references, thus streamlining kernel access to user data. The kuseg 
150 is used to hold user code and data, but the operating system 
often needs to reference this same code or data. The ksegO area 152 is a 
512-MByte...This kseg2 area differs from the kuseg area 150 only in that 
it is not accessible in the user mode, but instead only in the 
kernel mode . The operating system uses kseg2 for stacks and 
per-process data that must remap on context switches , for 
user page tables (memory map), and for some dynamically-allocated data 
areas. Kseg2 allows selective caching and mapping on a per page basis, 
rather... 



Page 25-80 



10/643769 



.18/3,K/16 (Item 16 from file: 348) 
DIALOG(R)File 348:EUROPEAN PATENTS 
(c) 2007 European Patent Office. All rts. reserv. 

00355847 

High-performance computer system with fault-tolerant capability; method for 

operating such a system 
Hochleistungsrechnersystem mit fehlertoleranter Fahigkeit; Verfahren zum 

Betrieb desselben 

Systeme de calculateur a haute performance a capacite de tolerance de 

fautes; methode pour operer un tel systeme 
PATENT ASSIGNEE: 

TANDEM COMPUTERS INCORPORATED, (524031), 10435 N. Tantau Avenue, 
Cupertino, California 95014-0709, (US), (applicant designated states: 
AT;DE;FR;GB;IT;SE) 
INVENTOR: 

Cutts, Richard W., Jr., 1312 Elm Street, Georgetown Texas 78626, (US) 
Banton, Randall G, 1805 Stoneridge Road, Austin Texas 78746, (US) 
Jewett, Douglas E., 12401 Wycliff Lane, Austin Texas 78727, (US) 
Norwood, Peter C, 2200 Klattenhoff Drive, Austin Texas 78728, (US) 
Debacker, Kenneth C, 15702 Monona Cove, Austin Texas 78717, (US) 
Mehta, Nikhil A., 1715 Prairie Hen Cove, Austin Texas 78758, (US) 
Allison, John David, 1406 Windsor Road, Nr. 202, Austin Texas 78703, (US) 
Horst, Robert W., 2804 Robeson Park Drive, Champaign Illinois 61821, (US) 

LEGAL REPRESENTATIVE: 
Altenburg, Udo, Dipl.-Phys. et al (1268), Patent- und Rechtsanwalte, 
Bardehle . Pagenberg . Dost . Altenburg . Frohwitter . Geissler & 
Partner, Galileiplatz 1, 81679 Munchen, (DE) 

PATENT (CC, No, Kind, Date): EP 372579 A2 900613 (Basic) 
EP 372579 A3 910724 
EP 372579 Bl 971001 

APPLICATION (CC, No, Date): EP 89122708 891208; 

PRIORITY (CC, No, Date): US 282629 881209; US 283574 881213 

DESIGNATED STATES: AT; DE; FR; GB; IT; SE 

INTERNATIONAL PATENT CLASS (V7): G06F-01 1/16; G06F-01 1/18; 

ABSTRACT WORD COUNT: 199 

LANGUAGE (Publication,Procedural,Application): English; English; English 

FULLTEXT AVAILABILITY: 

Available Text Language Update Word Count 

CLAIMS B (English) 9709W4 920 

CLAIMS B (German) 9709W4 851 

CLAIMS B (French) 9709W4 991 

SPECB (English) 9709W4 20185 
Total word count - document A 0 
Total word count - document B 22947 
Total word count - documents A + B 22947 

...SPECIFICATION local bus are connected to pipeline and bus control 
circuitry 87, driven from the internal bus structure 81 and the control 
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logic 82. 

The microprocessor block 75 in the processor 40 is of the RISC 
type in that most instructions execute in one machine cycle, and 
the instruction set uses register-to-register and load/store instructions 
rather than having complex instructions involving memory references... 
more difficult, and would degrade performance; also, use of the write 
buffer 50 serves to decouple the processors, and would be much less 
effective with close coupling of the processors. Likewise, the 
high-performance resulting from using instruction and data caches, 
and virtual memory management with the TLBs 83, would be more difficult 
to implement if close coupling were used, and performance would... then 
presents the interrupt to the processor chip 40. The result is that all 
of the processor chips received the external interrupt request at 
identical instructions, and the information saved in the 
holding registers is not needed. 

Holding Register: 

In the interrupt scenario presented above with reference to Figure 15, 
the voter 136 uses a holding register 138 to save some state information 
...provided in many machines. The processor 40 is configured to operate 
normally in the user mode until an exception is detected forcing it into 
the kernel mode , where it remains until a restore from 
exception (RFE) instruction is executed. The manner in which the 
memory addresses are translated or mapped depends upon the operating mode 
of the microprocessor, which is defined by... 

...The kuseg 150 segment for the kernel mode is 2-GByte in size, coincident 
with the "kuseg" of the user mode, so when in the kernel mode the 
processor treats references to this segment just like user mode 
references, thus streamlining kernel access to user data. The kuseg 
150 is used to hold user code and data, but the operating system 
often needs to reference this same code or data. The ksegO area 152 is a 
512-MByte kernel mode . The operating system uses kseg2 for 
stacks and per-process data that must remap on context 
switches , for user page tables (memory map), and for some 
dynamically-allocated data areas. Kseg2 allows selective caching and 
mapping on a per page basis, rather... 
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...SPECIFICATION The kuseg 150 segment for the kernel mode is 2-GByte in 
size, coincident with the "kuseg" of the user mode, so when in the 
kernel mode the processor treats references to this segment just 
like user mode references, thus streamlining kernel access to user 
data. The kuseg 1 50 is used to hold user code and data, but the 
operating system often needs to reference this same code or data. The 
ksegO area 152 is a 512-MByte... 

...This kseg2 area differs from the kuseg area 150 only in that it is not 
accessible in the user mode, but instead only in the kernel 
mode . The operating system uses kseg2 for stacks and per-process 
data that must remap on context switches, for user page > 
tables (memory map), and for some dynamically-allocated data areas. Kseg2 
allows selective caching and mapping on a per page basis, rather... 
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...SPECIFICATION This is for a separate cache for instruction buffering. It 
is organized and implemented differently from the unified cache. It is 
organized as a 4-block, 1-set associative cache for storing eight 
instructions in each of its four locations. It is implemented in 
static RAM cells for maximum speed. In contrast, the unified cache is 
implemented in dynamic... 

» 

...of the same properties as a micro routine in a microcoded machine. Micro 
mode is normally entered as a result of execution of a predetermined 
instruction in the instruction repertory (an MCALL 
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instruction) , or event . 

User mode and kernel mode are similar to the nonprivileged 
and privileged modes of most conventional processors. User 
mode is used to execute application code, kernel mode to execute the 
majority of the kernel. 

IP 101 has a... the Special Cache is used to cache instructions and data 
needed to manage TC loads. For both the IB cache and the unified cache, 
cache block size (or location width) is 64 bytes/8 
instructions/ 16 PTEs. The IB has 1x4 cache blocks ; the IDC 
has 64x2 cache blocks ; the TC and SC each have 8x2 blocks . 
During execution of a store instruction , data can be stored into a 
location in the IDC. This is not the case for the IB, TC, and SC. 

The IDC, TC, and.. .calculate the address of and load the PTE mapping 
the virtual page number; restore registers saved above; exit from T-mode 
by executing an MRET instruction . 

Loading the PTE mapping the virtual address causes the block of 
16 PTEs containing the PTE to be loaded into the translation cache. This 
requires that PTEs be stored in the first 256 megabytes of... 
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..ABSTRACT sequences as a atomic operation, the nonmicrocoded data 
processing system is provided with a third mode, in addition to the 
(nonprivileged) user mode and the (privileged) kernel 
mode , of operation that permits the execution of instruction 
sequences with interrupting events disabled and certain functions 
and apparatus enabled to facilitate instruction sequence execution. 

...SPECIFICATION and nonmicrocoded data processing systems. 

It is yet another feature of the present invention to provide a data 
processing system operating mode for execution of instructions in 
addition to the usual nonprivileged mode and privileged mode. 

It is still another feature of the present invention to permit 
execution of preselected groups of instructions in an atomic manner 
without interruption. 

It is... 

...THE INVENTION 
The aforementioned and other features are accomplished, according to 
the present invention, by providing a data processing system with three 
modes for executing instruction . The user mode, in which 
application programs are generally executed, and the kernel 
mode , in which privileged instructions are typically 
executed, are the usual modes of a data processing system. To these 
modes, the present invention adds a third mode, hereinafter referred to 



...is a diagrammatic illustration of the relationship of the data 
processing system operating modes. 
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Figure 5 illustrates the steps in transferring from the user or 
kernel mode to the EPICODE mode. 

Figure 6 illustrates that the EPICODE instruction sequences 
are stored in reserved areas of the main memory unit. 
DESCRIPTION OF THE PREFERRED EMBODIMENT 
1. Detailed Description of the Figures 

Referring now to.. .control in order to obtain the desired processing 
capabilities. The instructions are typically nonprivileged in the sense 
that the order and selected aspects of the instruction are under 
control of the user. The kernel mode 4B is the mode in which 
the operating system executes instructions . The kernel 
mode executes all instructions available in the user mode as 
well as additional instructions associated with the kernel 
mode 4B that are privileged and therefore are not available for 
manipulation by a user. Privileged instructions are not allowed in 
user mode because they could compromise the security of other users or 
programs. The EPICODE mode of data processing system operation... 

...reserved for instruction sequences that should execute without 
interruption and/or should not execute unless the data processing system 
is in a predetermined state. Some instructions that can be executed 
in user mode 4A or in kernel mode 4B require a transition 
into the EPICODE mode 4C. This mode is provided with certain privileges 
and certain dedicated hardware implementing the strategy to ensure... 

...instruction in the EPICODE format communicates to the data processing 
system the requirement to enter the EPICODE mode. In step 502, the issue 
unit is prevented from issuing new instructions , but the 
instructions for which execution has begun are completed. The 
completion of currently executing instructions permits all hardware 
exceptions to be signaled prior to execution in the... expedited, there 
being no need to transfer individual data and instruction elements. In 
addition, programs are generally written in a format that stores data 
and instruction elements needed for sequential instruction 
execution relatively close together in the program or file. Thus, a 
page of data and instruction elements will typically include a 
multiplicity of related data and instruction elements for program 
execution. None-the-less, the relative rigidity of the granularity of... 

...SPECIFICATION and nonmicrocoded data processing systems. 

It is yet another feature of the present invention to provide a data 
processing system operating mode for execution of instructions in 
addition to the usual nonprivileged mode and privileged mode 

It is still another feature of the present invention to permit 
execution of preselected groups of instructions in an atomic manner 
without interruption. 

It is... 
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...THE INVENTION 

The aforementioned and other features are accomplished, according to 
the present invention, by providing a data processing system with three 
modes for executing instruction . The user mode, in which 
application programs are generally executed, and the kernel 
mode , in which privileged instructions are typically 
executed, are the usual modes of a data processing system. To these 
modes, the present invention adds a third mode, hereinafter referred to 



..is a diagrammatic illustration of the relationship of the data 
processing system operating modes. 

Figure 5 illustrates the steps in transferring from the user or 
kernel mode to the EPICODE mode. 

Figure 6 illustrates that the EPICODE instruction sequences are 
stored in reserved areas of the main memory unit. 

DESCRIPTION OF THE PREFERRED EMBODIMENT 

1. Detailed Description ...control in order to obtain the desired 

processing capabilities. The instructions are typically nonprivileged 

in the sense that the order and selected aspects of the 

instruction are under control of the user. The kernel 

mode 4B is the mode in which the operating system executes 

instructions . The kernel mode executes all 

instructions available in the user mode as well as additional 

instructions associated with the kernel mode 4B that 

are privileged and therefore are not available for manipulation by a 

user. Privileged instructions are not allowed in user mode because 

they could compromise the security of other users or programs. The 

EPICODE mode of data processing system operation... 

...reserved for instruction sequences that should execute without 
interruption and/or should not execute unless the data processing system 
is in a predetermined state. Some instructions that can be 
executed in user mode 4A or in kernel mode 4B require a 
transition into the EPICODE mode 4C. This mode is provided with certain 
privileges and certain dedicated hardware implementing the strategy to 
ensure... 

...instruction in the EPICODE format communicates to the data processing 
system the requirement to enter the EPICODE mode. In step 502, the 
issue unit is prevented from issuing new instructions , but 
the instructions for which execution has begun are completed. The 
completion of currently executing instructions permits all hardware 
exceptions to be signaled prior to execution in the... expedited, there 
being no need to transfer individual data and instruction elements. In 
addition, programs are generally written in a format that stores data 
and instruction elements needed for sequential instruction 
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execution relatively close together in the program or file. Thus, 
a page of data and instruction elements will typically include a 
multiplicity of related data and instruction elements for program 
execution. None-the-less, the relative rigidity of the granularity of 



...CLAIMS Bl 

1 . A data processor (11), which executes instructions in a user 
mode (4A) of operation and a kernel mode (4B) of 
operation, for processing programs and for servicing interrupt 
requests that are applied to said processor, characterized in that 
said data processor comprises: 
processing... 

...wherein the control means (22) enables the processing means to execute 
non-privileged instructions when operating in the user mode (4 A), 
privileged or non-privileged instructions when operating in 
the kernel mode (4B), and non-interruptible sequences of 
instructions when operating in the third mode (4C). 
3. The data processor of Claim 2 wherein said non-interruptible 
sequences of instructions include special instructions in... 

...in said third mode. 

17. The data processor of Claim 1 wherein said control means enables 
said processing means to execute a plurality of overlapping 
instructions in said user mode and said kernel 

mode , said control means enabling said processing means to 
complete execution (502) of all currently executing 
instructions in said user and kernel mode prior to 
entering said third mode. 

1 8. A method of operating a data processor (11), which executes 
instructions in a user mode (4A) of operation and a 

kernel mode (4B) of operation, characterized in that 
said method comprises the steps of: 

enabling said processor to perform processing operations in at least 
three privilege modes... not include microcode for performing said 
operations that are processed in said third mode. 
21 . A method of operating a data processor (11) which executes 
instructions in a user mode (4A) of operation and a 
kernel mode (4B) of operation, characterized in that 
said method comprises the steps of: 

allowing said processor to temporarily suspend executing said 
instructions and service interrupt requests when said processor 
is operating in said user mode or said kernel mode ; and 

causing said processor to respond to a predetermined instruction 
executed in said user mode by executing a selected sequence of 
instructions (509), and inhibiting (503) said processor from 
suspending execution of said sequence and... 
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...SPECIFICATION the invention. 
Typically, the 3B15 system uses two modes, or levels, of operation of 
the WE 321 00 microprocessor: a user mode for executing user program 
instructions , and a privileged mode for executing 
functions such as operating system instructions that have the 
potential for corrupting shared system resources such as hardware 
resources. There are two mechanisms for entering privileged 
mode from user mode: a process switch mechanism and a system 
call mechanism. The system call mechanism is also known 
by names such as a supervisory call and an operating system trap. 
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...provides a means of controlled entry into a function by installing on a 
processor a new processor status word (PSW) and program counter (PC). The 
system call mechanism is used by explicit operating 
system GATE calls to transfer control from user to 
privileged mode , and is also used to handle "normal" system 
exceptions. ("Quick" interrupts, which would also be handled by this 
mechanism, are not used by the UNIX... 

...exception is an error condition-a fault or a trap-other than an 
interrupt. Normal exceptions constitute most system exceptions. The " 
normal exception handlers are privileged - mode functions. 

The system call mechanism uses the execution stack of the 
present process; that is, a normal exception handler or a function called 
via a GATE instruction uses for... 

...privileged state will not become corrupted. Therefore, prior to making a 
legal entry of a privileged function, i.e., before executing a transfer 
to the privileged mode upon the occurrence of a normal 
exception or a GATE request, the system call mechanism checks 
the present execution stack pointer value against the execution stack 
boundary values that are stored in the process control block of the 
presently... processor 12, or execution of the partially-executed 
instruction is merely completed on master processor 12. Unless the 
condition that caused the attempt to enter privileged mode 
was a transient fault, execution of that instruction on master 
processor 12 results in the invocation of the operating system service. 
That service is then provided in a conventional, uniprocessor, manner on 
master... the physical address of the slave's initial process control 
block). Also at step 502, a process switch to the slave's initial process 
control block is performed by the conventional RETPS 
instruction . As part of this process switch, the slave's process 
control block pointer register is set to the physical address of 
the intermediate portion of the slave's initial process control block, 
the initial program counter is... 

...step 503, which resets the process control block pointer register to the 
virtual address of the intermediate portion of the slave's initial 
process control block , and executes the conventional ENBVJMP 
instruction which enables virtual addressing for slave processor 25 
and transfers to the virtual address contained in register rO. 
The slave virtual-mode startup routine is. ..This is implemented by 
calling a slave delete routine (see FIG. 15) , at step 1002. 
Illustratively in this example, because the address of the faulting 
instruction is already saved in the process control block as 
result of stack exception handling, the faulting instruction is 
reexecuted once the process restarts execution on master processor 12. 
This feature is fairly typical of systems with demand-paged memory 
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..routine (see FIG. 14) must be invoked to avoid restart problems. This is 
illustratively accomplished, at step 1001, by removing the address of the 
faulting instruction from the process control block and 
substituting therefor the starting address of the instruction 
restart routine, and storing the removed faulting instruction's address 
in a variable. When the instruction restart routine completes, it 
restores the faulting instruction 's address in the process control 
block . 

Although far less usual than normal exceptions, another exception 
condition is possible when executing user-mode processes: system error. 
This category includes things like alignment... 

...as the standard, master, stack exception handler deals with MAU restart 
problems. The instruction restart routine then restores the address of 
the user process 1 faulted instruction to the process control 
block of the faulted process to cause execution of the faulted 
instruction , at step 1203, and returns at step 1204 to execution of 
that instruction. 

FIG. 15 flowcharts the slave delete routine. This routine is invoked on 



...CLAIMS indicator means associated with the occurred event. 

12. The system of claim 1 1 in a system wherein a change of execution mode 
comprises entry of privileged execution mode : wherein 

the plurality of events comprise interrupts; wherein 
the indicators comprise interrupt vectors: and wherein 
the functions identified by the indicators comprise interrupt 
handlers executable in privileged mode. 

13. The... 
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Claims 
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Detailed Description 

... primitive is an object that can be used to synchronize multiple 
threads' access to shared resources, such as critical sections, mutexes, 
semaphores or events. 

A thread is an executable entity that comprises a program counter, 
a user-mode stack, a kernel - mode stack and a set of register 
values. 

A token stream is a stream of hardware-independent tokens 

that describe a series of drawing operations. A token stream can be 

translated by a hardware-specific software component, such as... 

inefficient and generally does not scale well with multiple instances of 

a given piece of hardware in a system. As a rule, IN and OUT 

instructions can be executed only in kernel mode , the 

microprocessor mode that allows direct manipulation of hardware. If a 

user mode thread encounters ...application activity into 

DrawPrimitives2 ("DP2") tokens. When the DP2 token stream is submitted, a 

kernel transition occurs and the driver 314 translates the DP2 token 

stream into hardware-specific commands in kernel mode . 

Fig. 3B does not make any assumptions about whether the driver 314 is in 

user mode or kernel mode, and in this regard, driver component... that 

need to be written to this buffer, however, manually tracking this byte 

count WO 03/05843 1 PCT/US03/00396 

3 1 

surround the write instructions with a structured exception 

handling block , field the exception and return an error to the 

runtime if the command buffer overflows. The interface to the user mode 

driver would then be.. .commands' DDI counterparts into hardware-specific 
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commands and write them into a temporary buffer or buffers (since the 
exact size of the hardware-specific command stream is not known 
until after the translation has been performed). When the recording is 
stopped , or once the token stream has been parsed, the system 
could then allocate a command buffer or command buffers of suitable size 
and copy the translated hardware commands into them.. .could be used. The 
system of Fig. 7 closely resembles the state of the art in DIRECT3D(V 
driver structure, in that a DrawPrimitives2 token stream is passed 
to the kernel mode driver 705. The main difference is that 
the system of Fig. 7 contemplates OS-arbitrated scheduling of the command 
buffers, while DIRECT3D® currently does not... 

Claim 

... including preempting by the at least one coprocessor upon the 
occurrence of an external event. 

19 A method according to claim 18, wherein the external event is 
the operating system making a call to a corresponding 
kernel mode driver object to preempt the ...at least one 
coprocessor upon the occurrence of an external event. 

45 At least one computer readable medium according to claim 44, wherein 
the external event is the operating system making a 

call to a corresponding kernel mode driver object to 
preempt the at least one coprocessor. 

46 At least one computer readable medium according to claim 27, wherein 
the host processor is... preempted by the at least one coprocessor upon 

the occurrence of an external event. 

70 A computing device according to claim 69, wherein the external 
event is the operating system making a call to a 
corresponding kernel mode driver object to preempt the at 

least one coprocessor. 

71 A computing device according to claim 52, wherein the host processor 
is interrupted to coordinate... 
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... That is, the computer "freezes" up. Or to put it another way, the 
computer enters into an infinite loop. When this happens the computer 
system stops responding to the user's instructions and input. 

Then there are those arrows that are to be found ...data, in this case 
addressing values, travels from the "general purpose FIS Processor Unit" 
to the "Rest of general purpose FIS Computer." When one examines 
closely this particular data stream , what one finds is that 
...again to return to the use of the Kernel Addressing Systeni-retuming 
the general purpose FIS processor unit of claims (2) and (6) to the 
Kernel Mode . 

The second event that the "Fundamental Control Memory System" 
carries out is one that occurs whenever the system is going from the 
Kernel Addressing System to the Application... system has control of the 
"general purpose FIS Processor Unit", it then can make a determination. 
It can either allow the process that was 'ust stopped to run again 
for a given number of instructions . Or it can then allow another 
application, another process, to have its tarn at directing this said 
1 1 general purpose FIS Processor Unit." That is, the into the Kernel 
Mode after completing a certain number of instructions , is 
the most crucial of them. 

Function Four 

In the present processors built upon logic circuits, there are a large 
number of registers in the... program will, in general, operate in a 
multitasking environment or in a single application environment. 
Multitasking and Mimicry 

And as stated in claim (27), this instruction set that will be 
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applicable in both of these modes, the Kernel Mode and the 
Application Mode, can also be designed to serve another purpose; that of 
fitting as closely as possible with the instruction set, or 
instruction sets, that go along with the more popular, presently 
manufactured general purpose FIS microprocessors, hi doing this latter 
task, that of mimicking a present general-feedback Programmed Memory 
Systems". In particular, that group of "Primary bit-slice feedback 
Programmed Memory Systems" will be set up so as to mimic as closely 
as possible-and in some cases exactly-the various instruction sets 
of the various types of presently manufactured general purpose FIS 
microprocessors, as stated in claim (39). 

And to do this mimicry, it will be... on the first of the three data 
transfer subsystems of the "Data Input/Output Bus", two actions take 
place. 

The first is that the the " Hold " subsystem shown in fig. 2 stores 
the instruction value. The second 
54 of 70 

action is that the same instruction , now held by the "Hold" 
subsystem, is passed through the multiplexer ...a Data RAM system and 
the addressing value for that said Data RAM system has stepped forward by 
one. And what now happens in a block integer adder 
instruction is that the value stored in the above mentioned 
hold register/small memory circuit is reduced by one. 

Then if the value in the hold register/small memory circuit has reached 
the set ...integer addition can take place. 

And thus in this way the master controller carries out as many additions 
as the value first received by the hold register/small memory 
circuit when this block integer addition instruction was 
first begun. 

Chatige Addressingfor RAM 

The next function that the master controller must be able to accomplish 
is that of setting up addressing values... be sent to this new type of 
general processor. - Or put another way, any other type of move 
instruction other than the in-page move instructions will interfere 
with the smooth execution of a given application program. However in the 
kernel mode , the processor can change any of the addressing 
values for any of the four sets of addressing/accessing subsystems for 
any of the RAM or...must be noted that the use of this Twos Complement 
Unit, like the use of the integer Adder, can be run as part of a 
block instruction ; that is, a whole sequence of numbers can 
be converted to their twos complement in one long sequence, a sequence 
that will be coordinated out... to step forward by one and then send the 
next instruction to itself. At which point the master controller goes on 
to execute this next instruction . 
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The master controller will also be able to do block ANDs, ORs or 
XORs. 

Claim 

... be put to is to allow the general purpose FIS processor unit of 
claims ( 2) and ( 6) to keep track of how may times a block 
addition, block multiplication, block move and the like is 
repeated in a given instruction . 

63 that the various types of general purpose FIS processor units of 
claims ( 2) and 6) can be so designed so that two or more... 
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... the database software engine is responsible for execution of query 
plans generated by a query management portion of the database engine. The 
typical number of instructions per data unit for the data 
manipulation portion is close to 1 . As a result, a CPU capable of 
performing 15 (or 1 0 with SRAM) instructions in the time it takes to 
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fetch each... manipulation portion of the RDBMS. The DMS provides highly 
efficient execution of all data manipulation operations required for 
database processing upon multiple, high-bandwidth data streams 
flowing directly from a set of external storage devices which hold 
one or more databases. The DMS is attached directly to the external 
storage devices. 

A preferred embodiment of the architecture of the present invention is... 
1604 via a one-way data manipulation instruction signal path and a bi- 
directional control signal path. Note that in Fig. 16 single lines 
between blocks represent data manipulation instruction signal 
paths, double lines represent data flow paths, and dotted lines represent 
control signal paths. In all cases, the arrow heads represent the 
direction of... 

...1704 via a one-way data manipulation instruction signal path and a 
bi-directional control signal path. Note that in Fig. 17 single lines 
between blocks represent data manipulation instruction signal 
paths, double lines represent data flow paths, and dotted lines represent 
control signal paths. In all cases, the arrow heads represent the 
direction of...binaries are SP 1406, 1408, 1410, 1412 independent and can 
be dynamically assigned to any SP 14069 1408@ 1410® 1412 of any DMP 1452. 

Nano- thread instructions run in either of two modes: user 

execution mode or kernel 1 5 mode . All user mode 

instructions are executed by SPs 1406, 1408, 1410, 1412 without any 

assistance from any other DMP 1452 components. Kernel mode 

instructions are first intercepted by the CU 1 404 and then either 

executed in the CU 1404 or redirected to the BMI 1416 or the GPI 1402 for 

execution. Kernel mode instructions allow the DMP 1452 

to dynamically reconfigure the SPs 1406, 1408, 1410, 1412 to process 

different Nanothreads. This allows the DMP 1452 to perform thread 

switching so as to utilize all available resources at any given time. In 

addition, the kernel mode instructions allow the DMP 

1452 ...signal ports for connecting to the CU 1704 of Fig. 17. The LCUCR 

1910 communicates with the CU and delivers all control signals and 

kernel mode instructions back and forth between the CU 

and the SP 1900. 

Internal control information that is necessary to support its own 

operation is stored in the...Turning to Fig. 20, a flowchart depicting 

the interaction between the SP and the CU is presented. In Step SI, the 

SP waits for a kernel mode instruction and a control 

signal from the CU. In Step S2, the SP has received from the CU a 

kernel mode instruction defining a nano- thread 

address in the RFM and a control signal, and as a result the SP switches 

to the user execution mode from the wait mode. The SP proceeds with 

nano-thread execution. In Step S3, the SP fetches a nano-thread 

instruction . In Step S4, the instruction is decoded. Nano-thread 
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. execution may be stopped for two reasons: (a) the last instruction 
of a nano-thread is reached; or (b) a kernel mode 
instruction is reached. If the instruction is a kernel 
mode instruction , a control signal is issued to the CU in 
Step S5, the kernel mode instruction is sent to the CU, 
and the SP switches to wait mode in Step S7 before returning to Step S 1 . 
The SP remains in the wait mode until it receives new kernel 
mode instruction from the CU. If the instruction 
decoded in Step S4 is a STOP instruction indicating the last 
instruction in the thread has been reached, a control signal is 
issued to the CU in Step S8 and the SP switches to wait mode in Step S7 
before returning to Step SI. Again, the SP remains in the wait mode until 
it receives new kernel mode instruction from the CU. If 
the instruction decoded in Step S4 is a data processing 
instruction , the SP moves to Step S9 where the instruction is 
executed and then it returns to Step S3 to fetch another instruction. 

Turning to Fig. 21, the operation of the DMP 2100 is depicted. The ALU 

SPs 2106 run independent nano-threads concurrently. Running in user mode, 

nano- threads execute data processing instructions that 

operate with the RFM 2114. Each nano-thread can issue kernel 

mode instructions for BAM and RAM access, I/O, communication 

with the other DMSs and the GPC, or thread manipulation. All 

kernel mode instructions are handled by the CU 2 1 04, 

and then either executed by CU 2104 as thread manipulation or 

redirected for execution on the BMI 21 16 or the GPI 2102. The CU 2104 

handles and executes multiple kernel mode instructions 

concurrently and supports parallel operation of 43 all DMP 21 00 

components. The BMI 2116 executes BAM access instructions received from 

the CU 2104. It performs... 

Claim 

... to the means for 
instruction decoding; and 

means for control signal handling and manipulating in response to the 
instruction execution. . A method of processing data streams in a 
sub-processor comprising the steps of(a) waiting for a kernel mode 
instruction and a control signal from a control unit (b) receiving 
a kernel mode instruction pointing to an address in a 
register file 

memory containing a nano- thread for execution and a control signal; 

(c) switching to a user execution mode from a wait mode; 

(d) fetching the nano- thread instruction pointed to by the 
kernel mode instruction 

or the next unexecuted nano- thread instruction ; 

(e) decoding the fetched instruction ; 

(f) halting execution of the nano- thread if the decoded instruction 
is a kernel mode 

instruction ; 
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(g) halting execution if the decoded instruction is a last 
instruction in the nano 

thread ; and 

(h) executing the nano-thread instruction if the instruction is a data 
processing instruction and then returning to step (d) to fetch the next 



. . .nano-thread instruction. 

59 The method of claim 58 wherein the step (f) finther comprises: 
sending a control signal to the control unit; 

sending the kernel mode instruction to the control 
unit; 

switching to the wait mode from the user execution mode; and 
returning to step (a). 

60 The method of claim 58 comprising the steps of. 
switching to a user mode from a kernel mode in response to a 
kernel mode 

instruction ; 

executing nano- thread data processing instructions 

concurrently on one or more 

sub-processors; 

storing the results of nano- thread data processing 

instructions in a register file 

memory; 

switching to the kernel mode from the user mode in response 
to either reaching a I 0 kernel mode instruction while 
executing the nano- thread or reaching the end of the nano 
thread; 

executing, on a control unit, kernel mode instructions 

received from the sub 

processors as thread manipulations; 

redirecting, by the control unit, kernel mode 

instructions to a burst mode interface 

1 5 or a general purpose interface; 

executing, on the burst mode interface, kernel mode 

instructions to access burst 

mode memory and writes to the register file memory; 
executing, on the general purpose interface, kernel mode 
instructions to perform random access memory accesses, input/output 
instructions , and communication instructions received from 
the control unit, wherein execution of the instructions involve data 
transfers between the register file memory and a memory controller, an 
input/output adapter, a system bus adapter, or a data manipulation system 
controller ; and redirecting, by the general purpose interface, 
kernel mode instructions to the memory controller, the 
input/output adapter, the system bus adapter, or the data manipulation 
system controller. . A method of performing data manipulation functions 
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... the kernel context may occur frequently. According to the MIPS ISA, 
the CPU enters the kernel mode whenever an exception is detected and 
remains in kernel mode until a Restore From Exception (RFE) 
instruction is executed. Consequently, in an event driven 
application, frequent context switches can be expected regardless of the 
number of threads in user modes. 

The relative high speed of RISC processors make., .to the detailed 
description taken in conjunction with the provided figures. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a diagram of prior art pipeline instruction processing 

in a MIPS processor; Figure 2 is a schematic block diagram of the 

major functional blocks of a processor 

according to the invention; 

Figure 3 is a schematic block diagram of the major functional blocks., .a 
Con-sel code provides many significant advantages, particularly for real 
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For example, during interrupt processing, when a MIPS processor normally 

switches context from user mode to kernel mode , 

the processor according to the invention need not save and restore 

register contents. The processor according to the invention can switch to 

kernel mode in three instruction cycles and back to 

user mode in another three instruction cycles. Further, more than 

two threads are rapidly supported by loading register contents in the 

background via the AXI port with a coprocessor. The pro vision... the epc 

to the pc. The program should first move the epc to a general purpose 

register and exit the exception handler using a JR instruction . 

The present invention also adds a second status register to the interface 
block 78. The 
additional register, status 

txc, accommodates the additional coprocessors and masks for 
additional interrupt signals. This leaves the MIPS status register 
unaltered and preserves 
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... cache level which can be completed in much less time. 
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Thrashing, wherein each thread is locked in a 
repetitive cycle of switching threads without any 
instructions executing, can be prevented by the invention 
implementing a forward progress count register and method 
which allows up to a programmable maximum number of thread 
switches after which the... processing system according to the 
present invention. 

Figure 3 illustrates a block diagram of the storage 
control unit of Figure 2. 

Figure 4 illustrates a block diagram of the thread 
switch logic, the storage control unit and the instruction 
unit of Figure 2. 

Figure 5 illustrate the changes of state of a thread as 

the thread experiences different thread switch events shown 

in Figure... which 

allows up to a programmable maximum number of thread 
switches called the forward progress threshold value. 

After that maximum number of thread switches, an 
instruction must be completed before switching can occur 
again. In this way, thrashing is prevented . Forward 
progress count register 420 may actually be bits 30:3 1 in 
the thread switch control register 410 or a software 
programmable forward progress threshold... 

..610, bits 15:17 in thread state 

register 442 pertaining to thread TO are reset to state 

111. Execution of this thread is attempted in block 620 

and the state changes to 000. If an instruction 

successfully executes on thread TO, the state of thread 

TO returns to 1 1 1 and remains so. If, however, thread TO 

cannot execute an instruction, a... itself to change the priority of 

itself. 

tsop I or 1,1,1 - Switch to dormant thread 

tsop 2 or 1,1,1 - Set active thread to LOW priority 

- Switch to dormant thread 

- NOTE: Only valid in privileged mode unless TSC[ 191=1 
tsop 3 or 2,2,2 - Set active thread to MEDIUM priority 
tsop 4 or 3,3,3 - Set active thread to HIGH priority 

- NOTE: Only valid in privileged mode 
Priority switch instructions tsop 1 and tsop 2 can 

be the same instruction as embodied herein as or 1,1,1 

but they can also be separate instructions. These 

instructions interact with bits 19 and 21 of the.. .thread switch control 

register 320 is zero, the 
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priority for both threads is set to medium and the effect 
of the or x,x.,x instructions on the priority is blocked . 
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If an external interrupt request is active, and if the 
corresponding thread's priority is low, that thread's 
priority is set to medium. 
The... 
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... processing by the use of a software 
thread control manager. 

Thrashing, wherein each thread is locked in a repetitive 
cycle of switching threads without any instructions executing, 
can be prevented by the invention implementing a forward 
progress count register and method which allows up to a 
programmable maximum number of thread switches called the 
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...computer networks; and another forward progress 
threshold for shorter latency events such as cache misses. In 
these cases, the processor system may use a software 
instruction and the hardware response to an external processor 
latency event to block the incrementing of the forward progress 
counter in the thread state register, in other words, to block 
the change of state of the particular thread... or the network by the 
multithreaded processor causes 

an external processor latency event. An external latency event 
causing an excessive processor delay or a software instruction 
can block the change of state of the particular thread in the 
thread state register. Regardless of the state of the 
particular thread, the multithreaded processor still... 

...processing system according to the present 
invention. 

Figure 3 illustrates a block diagram of the storage 
control unit of Figure 2. 

Figure 4 illustrates a block diagram of the thread switch 
logic, the storage control unit and the instruction unit of 
Figure 2. 

Figure 5 illustrate the changes of state of a thread as 
the thread experiences different thread switch events shown in 
Figure... which allows up to a programmable 
maximum number of thread switches called the forward progress 
threshold value. After that maximum number of thread switches, 
an instruction must be completed before switching can occur 
again. In this way, thrashing is prevented . Forward progress 
count register 420 may actually be bits 30:3 1 in the thread 
switch control register 410 or a software programmable forward 
progress threshold.. .counter bits 15:17 in thread 
state register 442 pertaining to thread TO are reset to state 
111. Execution of this thread is attempted in block 620 and 
the state changes to 000. If an instruction successfully 
executes on thread TO, the state of thread TO returns to 111 
and remains so. If, however, thread TO cannot execute an 
instruction, a... the executing software itself to change the priority 
of itself. 

tsop I or],],] -Switch to dormant thread 

tsop 2 or 1,1,1 -Set active thread to LOW priority 

-Switch to dormant thread 

-NOTE: Only valid in privileged mode unless TSC[19]=1 
tsop 3 or 2,2,2 -Set active thread to MEDRJM priority 
tsop 4 or 3,3,3 -Set active thread to HIGH priority 
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-NOTE: Only valid in privileged mode 
Instructions tsop 1 and tsop 2 can be the same instruction 
as embodied herein as or 1 .1,1 but they can also be separate 
instructions. These instructions interact with bits 19 and 21 
of the... thread 

switch control register 320 is zero, the priority for both 
threads is set to medium and the effect of the or x„x„x 
instructions on the priority is blocked . If an external 
interrupt request is active, and if the corresponding thread's 
priority is low, that thread's priority is set to medium. 

The... 

Claim 

... or the network by the multithreaded processor 
causes an external processor latency event. 

1 1 The computer system of Claim 9 or 10 further comprising 
an instruction to block the change of state of the 
particular thread in the thread state register. 

12 The computer system of any one of Claims 9 to 1 1... 
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Detailed Description 

... level but which can be completed in much less time. 

Thrashing, wherein each thread is locked in a repetitive 

cycle of switching threads without any instructions executing, 

can be prevented by the invention implementing a progress count 

register and method which allow up to a programmable maximum 

number of thread switches af ter which the... processing system according 

to the present 

invention. 

Figure 3 illustrates a block diagram of the storage 
control unit of Figure 2. 

Figure 4 illustrates a block diagram of the thread switch 
logic, the storage control unit and the instruction unit of 
Figure 2. 

Figure 5 illustrate the changes of state of a thread as 

the thread experiences different thread switch events shown in 

Figure... which allows up to 

a programmable maximum number of thread switches called the 
forward progress threshold value. After that maximum 
number of thread switches, an instruction must be completed 
before switching can occur again. In this way, thrashing 
is prevented . Forward progress count register 420 may 
actually be bits 30:31 in the thread switch control 
register 410 or a software programmable forward progress 
threshold... 

...610, bits 15:17 in thread state 
register 442 pertaining to thread TO are reset to state 
111. Execution of this thread is attempted in block, 620 
and the state changes to 000. If an instruction 
successfully executes on thread TO, the state of thread TO 
returns to 111 and remains so. If, however, thread TO 
cannot execute an instruction, a... itself to change the priority of 
itself. 

tsop I or 1,1,1 - Switch to dormant thread 

tsop 2 or 1,1,1 - Set active thread to LOW priority 

- Switch to dormant thread 
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- NOTE: Only valid in privileged mode unless TSC[19]=1 
tsop 3 or 2,2,2 - Set active thread to MEDrUM priority 
tsop 4 or 3,3,3 - Set active thread to HIGH priority 

- NOTE: Only valid in privileged mode 

Priority switch instructions tsop 1 and tsop 2 can be 
the same instruction as embodied herein as or 1 , 1 , 1 but 
they can also be separate instructions. These instructions 
interact with bits 19 and 21 of the... the thread switch 
control register 320 is zero, the priority for both threads 
is set to medium and the effect of the or xx,x 
instructions on the priority is blocked . If an external 
interrupt request is active, and if the corresponding 
thread's priority is low, that thread's priority is set to 
medium. 

The... 
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Detailed Description 
... block diagram of the storage 
control unit of Fiqure 2. 

Figure 4, consisting of two drawing sheets designated 
Pigure 4A and Figure 4B, illustrates a block diagram of the 
thread switch logic, the storage control unit and the 
instruction unit of Figure 2. 

Figure 5 illustrate the changes of state of a thread as 

the thread experiences different thread switch events shown in... which 

allows up to a programmable 

maximum number of thread switches called the forward progress 
threshold value. After that maximum number of thread switches, 
an instruction must be completed before switching can occur 
again. In this way, thrashing is prevented . Forward progress 
count register 420 may actually be bits 30:31 in the thread 
switch control register 410 or a software programmable forward 
progress threshold... 

...610, bits 1 5 : 1 7 in thread state register 442 
pertaining to thread TO are reset to state 111. Execution of 
this thread is attempted in block 620 and the state changes to 
000. If an instruction successfully executes on thread TO, the 
state of thread TO returns to 111 and remains so. If, however, 
thread TO cannot execute an instruction, a... itself to change the 
priority 
of itself. 

tsop I or 1,1,1 -Switch to dormant thread 

tsop 2 or 1,1,1 -Set active thread to LOW priority 

-Switch to dormant thread 

-NOTE: Only valid in privileged mode unless TSC[19]=1 
tsop 3 or 2,2,2-Set active thread to MEDIUM priority 
tsop 4 or 3,,3,3-Set active thread to HIGH priority 
-NOTE: Only valid in privileged mode 
Instructions tsop I and tsop 2 can be the same instruction 
as embodied herein as or 1,1,1 but they can also be separate 
instructions. These instructions interact with bits 19 and 21 
of the... 

...thread 

switch control register 320 is zero, the priority for both 
threads is set to medium and the effect of the or x,x„x 
instructions on the priority is blocked . If an external 
interrupt request is active, and if the corresponding thread's 
priority is low, that thread's priority is set to medium. 
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Detailed Description 
... which allows up to a 

programmable maximum number of thread switches called the 

forward progress threshold value. After that maximum number 

of thread switches, an instruction must be completed before 

switching can occur again. In this way, thrashing is 

prevented . Forward progress count register 420 may actually 

be bits 30:3 1 in the thread switch control register 41 0 or a 

software programmable forward progress threshold...610, bits 15:17 in 

thread state register 442 pertaining to 

thread TO are reset to state 111. Execution of this thread is 

attempted in block 620 and the state changes to 000. If an 

instruction successfully executes on thread TO, the state of 

thread TO returns to 1 1 1 and remains so. If, however, thread 
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. TO cannot execute an instruction, a... itself to change the priority 
of itself. 

tsop I or 1,1,1 -Switch to dormant thread 

tsop 2 or 1,1,1 - Set active thread to LOW priority 

-Switch to dormant thread 

-NOTE: Only valid in privileged mode unless TSC[19]=1 
tsop 3 or 2,2,2 -Set active thread to MEDRJM priority 
tsop 4 or 3,3,3 -Set active thread to HIGH priority 
-NOTE: Only valid in privileged mode 
Instructions tsop 1 and tsop 2 can be the same instruction 
as embodied herein as or 1,1,1 but they can also be separate 
instructions. These instructions interact with bits 19 and 21 
of the... 

...thread 

switch control register 320 is zero, the priority for both 
threads is set to med urn and the effect of the or xx.,x 
instructions on the priority is blocked . If an external 
interrupt request is active, and if the corresponding thread's 
priority is low, that thread's priority is set to medium. 
The... 
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Detailed Description 

... are protected from it. This ensures that user mode services and 
applications will not write over each other's memory, or execute each 
other's instructions . Kernel mode services and 
applications are protected in a similar way. If an attempt to access 
memory outside of a prograrn@s allocated virtual space occurs, the...of 
predetermined instruction sequences, transfer control to the one 
of the plurality of predetermined instruction sequences, and 
process the one of the plurality of predetermined instruction 
sequences from the virtuai memory. 

BRIEF DESCRIPTION OF THE DRAWINGS 
Figure I is a system block diagram of an exemplary processor 
system in which the apparatus and method of the present invention 
is used. 

Figure 2 is an overall functional block... aware applications. 

Terms 
APM BIOS 

System BIOS which provides power management functions adhering to the 
APIA specification (currently at 
revisionlevel 1.2) 
Kernel Made 

A privileged processor mode in which the NT system code nins. 
A kernel mode thread has access to all Ito 
and system memory 

Power Management Kernel Driver (PM Driver) 

Provides access to the BIOS ROM, BIOS Data Area and the... 
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... hardware. In most systems, security is achieved by software, such as 
described above, which is not entirely secure because there is no 
security hardware to block access to the security software by an 
intruder. Another problem associated with software encryption algorithms 
is that some of the software encryption algorithms run slower...Note that 
any DMA transfer which is already running will never be preempted. The 
only way for a DMA transfer to be aborted in mid- stream is for a 
Host bus error to occur (e.g. a PCI abort due to a Parity error) or for 
the DSP to force an. . . 1 5 : 0] 
PCI Local Address High Register (PCB31AH) 
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If the transfer is between a PCI Host and External Memory (case 3), then 
this register holds the 10 most-significant bits of the External 
Address [25:16] and the most-significant bit will be set to T. If the 
transfer is.. .loaded as part of the crypto-context. 

Hash Paddin 

Controls are also provided to determine what is done at the end of the 
input data stream . If the operation is to be resumed at a later 
time, then the operation should be defined to exclude 'final' processing. 
In this case, no... 

...512-bit padding hash block will be added to contain the length. Any 
additional bytes required to fill out the last or extra 512-bit 
blocks are set to zero. 

As previously described, the same input data stream is internally 
buffered by both the hash and encryption sections, depending on the data 
flow of the operation selected. In the case of hash-encrypt. . . 

...negotiated for an IP connection. The pre-computed inner hash is loaded 
as the initial state of the hash algorithm before processing the input 
data stream . At the end of the 
56 

input data, normal 'final 1 processing is done, then the resulting digest 
is used as data for an additional round... sequence of steps is simplified 
if operations are performed serially with no overlap. 

Final processing for hashing operations, and padding of a short trailing 
crypto block , is initiated automatically. The end of the input data 
stream for an operation is determined by counting the number of 
bytes input and comparing to the byte length entered as part of the 
operation control... Address (READ ONLY) 
MCIA ... ::. 

Oxl8AO Not Visible Not Visible 

1514131211109876543210 

0 

1L0 

1 = DSP low Byte Enable 
LI = DSPMgh Byte Enable 
Reserved 

Reset Reason/ Instruction Register (RSTREASON) 

This 16-bit Read/Write register, as shown in the table below, allows the 

DSP processor to determine the offending address and memory... 

...section previous described on Kemal Mode Control for more information 
about the protection features of the CryptlC. 



The four memory types are as follows. 



Page 59-80 



10/643769 



PM Instruction Fetch An instruction fetch was made either to 
one of the four blocks of 

Kernel ROM (i.e. the 4 lsb's of PMOVLAY were set to OxC, D, E or 
F and a fetch was made between... O OIOIOIO70 
14-bit Memory Address which caused violation 
I 0 Memory Type which caused violation 
00 = PM Instruction Fetch 

0 1 = DM Data Fetch 
10 = PM Data Fetch 

1 1 = Protected Registers@ernel RAM data fetch 
External Memory Configuration Register (EXMEMCFG) 
Thisl6...1. 

Host wrote Command The Host Processor has just written the 
least-significant word of the Application Registers mailbox. This 
notifies the DSP of this event . 

DSP wrote Command The D SP has just written the least-significant word of 
the Application Registers mailbox. This notifies the Host of this event 
...has been Queued 

I = Master DMA Memory Transfer is Done 

I = Host Wrote Command Register 

I = Hash/Encrypt Context 0 is Done 

I = Hash/Encrypt Context I is Done 

1 = Host External Memory Conflict 

I = Host Issued Interrupt 

I = HasivEncrypt Error 

Reserved 

DSP Masked Status Register (DMSTAT) 

This 16-bit Read... the CGX Kernel context and invokes the RTI operation. 
The Status registers and Program Counter are automatically popped off the 
hardware stack at the RTI instruction . The return-frominterrupt 
operation vectors back into the CGX Kernel mode , at which 
point the Kernel Protection logic will resume and the interrupted CGX 
command processing will continue. 

104 

Storage Management 

The CGX Command Processor is. ..by the application, the CGX Kernel is 
entered by setting the PMOVLAY register to OxOOOF and then executing a 
CALL 0x2000 instruction. From the call instruction and some 
associated hardware logic and signals, the internally ROM encoded CGX 
Kernel program block is overlaid in the upper portion of the DSP's 
program space. This allows the CGX Kernel code to become active and 
transparent to the... is a specific call to address 0x2000 instruction 
with the PMOVLAY register set to OxOOOF. The AR register must contain a 
pointer to the kernel block which defines the command requested. 
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When the software call instruction occurs, the special hardware 
logic (along with some hardware signals and counters) overlay the ROM 
encoded CGX Kernel 
157 

program block over a portion of... block (described 
below). 

The data buffers, datain (plain-text) and dataout (cipher-text)," 

can share the same address. Moreover, the chip currently only supports 

cipher- block symmetrical algorithms but the encrypt interface 

provides 

for the addition of a stream interface. It does this by defining 
datain 

and dataout as unsigned, char pointers and the data length of datain to 
be in bytes. Allowing the byte count makes the interface portable for 
the later addition of stream based algorithms. However, if the 
algorithm to be used in the encrypt operation is cipher- block based 
the 

byte count must be evenly divisible by 8, any fragments are ianored by 
the operation. 

206 

Prior to invoking the encryption command the... block (described 
below). 

The data buffers, datain (plain-text) and dataout (cipher-text), 

can share the same address. Moreover, the chip currently only supports 

cipher- block symmetrical algorithms but the decrypt interface 

provides 

for the addition of a stream interface. It does this by defining 
datain 

and dataout as ...char pointers and the data length of datain to 
be in bytes. Mowing the byte count makes the interface portable for 
the later addition of stream based algorithms. However, if the 
algorithm to be used in the decrypt operation is cipher-block based 
the 

byte count must be evenly divisible by 8, any fragments are ignored by 
the operation. 

Prior to invoking the Decrypt command the application... 
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... rights of citizens and 

organizations who use this information (also 'electronic 1 or 
"digital") highway. 

2 

Electronic Content 

Today, virtually anything that can be represented by 
instructions can be formatted into electronic digital information. 
Television, cable, satellite transmissions, and on-line services 
transmitted over telephone lines, compete to distribute digital 
information and. ..other 

parties. Distribution may be by, for example, physical media 
delivery, broadcast and/or telecommunication means, and in the 
form of 'static" files and/or streams of data. VDE mav also be 
used, for example, for multi-site "real-time 1 'interaction such as 
teleconferencing, interactive games, or on-line bulletin boards... be 
protected. 

-176 

Rules and Contents' Can Be Separately Delivered 

As mentioned above, virtual distribution environment 100 

"associates" content with corresponding "rules and controls," and 

prevents the content from being used or accessed unless a set of 

corresponding 'rules and controls' is available. The distributor 

106 doesn't need to deliver.. .as a "computer prograni(s)" "embedded' 

within 

- 184 chip 504. Firmware 508 makes the hardware 506 work. 
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Hardware 506 preferably contains a processor to perform 
instructions specified by firmware 508. "Hardware" 506 also 
contains long-term and short-term memories to store information 
securely so it can't be tampered with.. .control security-relevant aspects 
of other components in the CPU/SPU. The "SPU" mode in this 
- 224 

to be fetched from secure memory 532, 534 « preventing 
execution based on "mixed" secure and non-secure instructions . 

In the "SPU" mode, mode interface switch 2658 may, in one 
example embodiment, disconnect or otherwise block external 
accesses ca= ed over bus 652 from... 

..or otherwise 

block access by microprocessor 2652 to some external memory 
and/or other functions carried over bus 652. Mode 'interface 
'tch 2658m' this example prevents other CPU 
swi 

operations/ instructions from exposing the contents of secure 
memory 532, 534. 

In the example shown in Figure 9A, the mode control of 

mode interface switch 2658 is.. .exposed by microprocessor 2652 operations 

that occur in the 

"normal" mode. This may be accomplished either by hardware 
mechanisms that protect against such exposure, software 
instructions executed in "SPU" mode that clear, reinitialize, and 
otherwise reset during such transitions, or a combination of both. 

In some example inlDlementations, 'interrupts may be... maintain security 

in such multi-package versions, it 

may be necessary to enclose all the packages and their 

interconnections in an external physical tarnper-resistant 

barrier . 

-230 

Initialization of Integrated CPU/SPU 

Instructions and/or data may need to be loaded into 

CPU/SPU 2650 before it can operate effectively as an SPU 500. 

This may occur during... precisely as does microprocessor 
2652 with respect to power-on reset and other external 
conditions. Under such conditions, only execution of the "enable 
SPU mode" instruction or otherwise requesting SPU mode under 
program control would cause "SPU" mode to be entered. 
Additionally, a mechanism could be provided to permit 
microprocessor 2652... 
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...its operation. 



In the preferred embodiment, CPU/SPU 2650 would, when 
SPU mode has not yet been established, begin operating in SPU 
mode by fetching instructions from secure non-volatile memory 
532, thereby ensuring a consistent initialization sequence and 
preventing SPU dependence on any information held outside 
CPU/SPU 2650. This approach permits secret initialization 
information (e.a., keys for validating digital signatures on 
additional... microprocessor 2652 indicatina entrv to, exit from, and 
control of SPU mode. 

Switch 2663 'in this example reco a specific 
gm 

indication (e.g., an instruction fetch access to a designated 
address in the secure memory 532) as the equivalent to the 
enable 'SPU mode" instruction . Upon recognizing such an 
indication, it may isolate the CPU/SPU 2650 from external buses 
and interfaces 2664, 2665, and 2667 such that any external... 
...such as DMA cycles, would be "held" until the switch 
235 

single access to a specific location in secure memory 532 to 
complete. 

The single instruction fetched from the designated location 
performs a control operation (a cache flush, for example), that 
can only be performed in microprocessor 2652's most privileged 
operating mode , and that has an effect visible to switch 2663. 

Switch 2663 awaits the occurrence of this event, and if it does 
not occur within the expected number of cycles, does not enter 
"SPU" mode. 

Occurrence of the control operation demonstrates that 
microprocessor 2652 is executing in its most privileged "normal" 
mode and therefore can be trusted to execute successfully the 
#1 enter'SPU mode" sequence of instructions stored in secure 
memory 532. If microprocessor 2652 were not executing in its 
most privileged mode , there would be no assurance that those 
instructions would execute successfully. Because switch 2663 
isolates microprocessor 2652 from external signals (e.g., 
interrupts) until "SPU" mode is successfully initialized, the entry 
instructions can be guaranteed to complete successfully. 

236 

Following the initial instruction , switch 2663 can enter 
of partial SPU mode," in which a restricted area of ROM 532 and 
RAM 534 may be accessible. Subsequent instructions in secure 
memory 532 may then be executed by microprocessor 2652 to 
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place it into a known state such that it can perform SPU 
functions ~ saving any previous state in the restricted area of 
RAM 534 that is accessible. After the known state is established, 
an instruction may be executed to deliver a further indication 
(e.g., a reference to another designated memory location) to 
switch 2663, which would enter "SPU" mode... 

...mode. Once in "SPU" mode, switch 

2663 permits access to all of ROM 532, RA"M 534, and other 
devices in SPU chip 2660. 

The instructions executed during "partial SPU" mode must 
be carefully selected to ensure that no similar combination of 
instructions and processor state could result 'in a control 
transfer 

out of the protected SPU code in ROM 532 or RAMT 534. For 
example, internal debugging... 

...disabled or 

reinitialized to ensure that previously created MMU data 
-237 

structures would not permit SPU memory accesses to be 
compromised. The requirement that the instructions for "partial 
SPU mode" run in the microprocessor 2652 ? s most privileged 
mode is necessary to ensure that all its processor control 
functions can be effectively disabled. 

The switch 2663 provides additional protection against 
tampering by ensuring that control functions) may 
be prevented by switch 2663 unless they had been explicitly 
enabled by instructions executed after "SPU" mode is entered. 

.238 

To leave SPU mode and return to normal operation, the 
instructions executing in "SPU" mode may provide a... 
...such an embodiment, switch 2663 may not implement partial 
SPU mode, but may instead enter SPU mode directly and ensure 
that the address from which instructions would be fetched by 
microprocessor 2652 (specific to microprocessor 2652 f s 
architecture) results in accesses to appropriate locations in the 
SPU memory 532. This could... users. 

Computer systems are usually made up of several different 
hardware components. These hardware components include, for 
example. 

a central processing unit (CPU) for executing instructions ; 
- 246 

an array of main memory cells (e.g., ? RAIM" or 'ROM") for 
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, storing instructions for execution and data acted upon or 
parameterizing those instructions ; and 
one or more secondary storage devices (e.g., hard disk 
drive, floppy disk drive, CD-ROM drive, tape reader, card 
reader, or "flash' memory... discover details of processes; using 
self-generating" code (based on the output of a co-sine 
transform, for example) such that detailed and/or complete 
instruction sequences are not stored explicitly on storage devices 
and/or in active memory but rather are generated as needed; 
using code that "shuffles 1 memory locations.. .isuq 
f uLioj.iad oz 2uuudaid -io/pu-e 12umuopad u-i asnioj sdriqsuoT -elei 
DPZST/L6Sfl/l Jd 6OZ60/86 OM 
information that identifies associated basic 
instructions and said intrinsic data for access, 
correlation and/or validation purposes; 
0 required and/or optional parameters for use with 
basic instructions and said intrinsic data; 
0 information defining relationships to other methods; 
0 data elements that may comprise data values, fields 
of information, and/or the... 
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... streams enables 

the effectively concurrent parallel transfers of digital 
data streams to and through the channel controller. 
This substantially alleviates the need for host 
application context switches and most host user to 
interrupt kernel mode switches while providing real-time 
transport support for multiple concurrent data streams 

A yet further advantage of the present invention is 
that it provides dynamic support of variable rate real 
time signal processing by ensuring effectively 
continuous.. .concurrent data and control signal transfers 
between the PCI bus 20 and system 22. These concurrent 
transfers are preferably managed as independent pairs of 
data streams and control threads of execution where each 
stream is composed of discrete block transfers, 
consisting of one or more bytes of data, through the I/O 
channel controller 26. Data blocks of concurrently 
supported streams are interleaved among one another to 
effectively establish concurrent stream data transport 

In accordance with the present invention, the 

interleave of data blocks is determined by ...core 26 reflects the 

respective demand transport rates 

of the individual data streams as determined from the 

nature of the stream data transported. Consequently, 

where stream data represents a constant data frequency 

audio signal stream , the interleave of data blocks in 

the audio data stream may vary, though generally 

maintaining a fixed through-put rate reflective of the 

data stream transfer rate independently demanded by an 

audio stream coder/decoder... transfer 

The host processor establishes a control thread in 

the main memory space 54 by constructing one or more 

linked bus transfer unit (BTU) control blocks to 

associate the stream data as provided in the main memory 

space 54 with the control thread. The host processor 

12, in execution of the I/O channel controller... stream 72 may be 

subsequently 

transferred again by the digital signal processor to any 
directly connected serial bus peripheral 40, 42 
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Bus transfer unit control blocks that define a 

control thread for transferring the DSP RAM stored data 

stream 69, 80 can ...36 

to operate by definition independent of one another with 
respect to the use of the DSP RAM 34 

The DSP bus transfer unit control blocks are 
defined to control the transfer of one or more data 
streams through and under the control of the V ...1 16 to the 
memory 122 or external 

hardware 124 of the controller peripheral layer 118. 

The device driver 1 12 establishes one or more BTU 

control blocks within the memory 106 defining a control 

thread necessary to implement the data stream transfer. 

The initial BTU of the control thread is programmed by 

the device driver 112 into the I/O channel controller 

and a signal is.. .channel controller core 

116. Preferably, the message is acknowledged by the DSP 

120, the requested function is initialized, a portion of 

the externally provided data stream is received and 

processed by the DSP 120 and stored in a known address 

block within the memory 122. An enable signal 136 ...DSP side BTU 

is loaded into the I/O 

channel controller core 116 and left disabled. 

Meanwhile, a next portion of the externally provided 

data stream is received and processed into another known 

address block within the memory 122 that is referenced 

by the next DSP side BTU. Again, once this address 

block has been filled with processed ...136. In general, the two address 

blocks in 

the memory 122 are successively alternatingly used in 
successive alternation for the processing of a single 
data stream 

When the address block specified by the current 
host side BTU in the memory 106 is filled by the 
processed incoming data stream, the BTU is complete and 
the... 
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Detailed Description 
... streams enables 

the effectively concurrent parallel transfers of digital 
data streams to and through the channel controller. 

This substantially alleviates the need for host 
application context switches and most host user to 
interrupt kernel mode switches while providing real-time 
transport support for multiple concurrent data streams . 

is A yet further advantage of the present invention is 

that it provides dynamic support of variable rate real 

time signal processing by ensuring effectively... concurrent data and 

control signal transfers 

between the PCI bus 20 and system 22. These concurrent 
transfers are preferably managed as independent pairs of . 
data streams and control threads of execution where each 
is stream is composed of discrete block transfers, 
consisting of one or more bytes of data, through the 1/0 
channel controller 26. Data blocks of concurrently 
supported streams are interleaved among one another to 
effectively establish concurrent stream data transport. 



In accordance with the present invention, the 
interleave of data blocks is determined by the 
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. controller sub-system 22 and, in general, independent of 
the execution of the host processor 12, The interleave 
of data blocks... 

...core 26 reflects the respective demand transport rates 
of the individual data streams as determined from the 
nature of the stream data transported. Consequently, 
where stream data represents a constant data frequency 
audio signal stream , the interleave of data blocks in 
the audio data stream may vary, though generally 
maintaining a fixed through-put rate reflective of the 
data stream transfer rate independently demanded by an 
audio stream coder/decoder... transfer. 

The host processor establishes a control thread in 

the main memory space 54 by constructing one or more 

linked bus transfer unit (BTU) control blocks to 

associate the stream data as provided in the main memory 

space 54 with the control thread. The host processor 

12, in execution of the I/O channel controller... stream 72 may be 

subsequently 

transferred again by the digital signal processor to any 
directly connected serial bus peripheral 40, 42. 

Bus transfer unit control blocks that define a 
control thread for transferring the DSP RAM stored data 
stream 69, 80 can be provided in either system main or 
DSP memory 14, 34. Preferably, the DSP control thread 
bus transfer unit control blocks are... 

...36 

to operate by definition independent of one another with 
respect to the use of the DSP RAM 34. 

The DSP bus transfer unit control blocks are 

defined to control the transfer of one or more data 

streams through and under the control of the I/O channel 

controller core 62 to any of the external hardware 

interfaces 74 of the serial and...l 16 to the memory 122 or external 

hardware 124 of the controller peripheral layer 118. 

The device driver 1 12 establishes one or more BTU 
control blocks within the memory 106 defining a control 
thread necessary to implement the data stream transfer. 

The initial BTU of the control thread is programmed by 
the device driver 112 into the I/O channel controller 
116 and a signal... channel controller core 
116. Preferably, the message is acknowledged by the DSP 
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120, the requested function is initialized, a portion of 
the externally provided data stream is received and 
processed by the DSP 120 and stored in a known address 
block within the memory 122. An enable signal 136 is 
then provided to the I/O channel controller core 1 16 to 
enable the initial DSP side... 

...DSP side BTU is loaded into the I/O 
channel controller core 116 and left disabled. 

Meanwhile, a next portion of the externally provided 

data stream is received and processed into another known 

address block within the memory 122 that is referenced 

by the next DSP side BTU. Again, once this address 

block has been filled ...136. in general, the two address blocks in 

the memory 122 are successively altematingly used in 

successive alternation for the processing of a single 

data stream. 



When the address block specified by the current 
host side BTU in the memory 106 is filled by the 
processed incoming data stream, the BTU is complete and 
the... 
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Detailed Description 
... for repeatedly 

evaluating the status information and providing for the 
prioritied selection of a first data transfer device and 
a predetermined one of the memory blocks . 

A computer system utilizing the multiple parallel 
digital data stream channel controller of the present 
invention can thus support the concurrent real-time 
transfer of a plurality of I/O data streams to and from... 

...streams enables 

the effectively concurrent parallel transfers of digital 
data streams to and through the channel controller. 
This substantially alleviates the need for host 
application context switches and most host user to 
interrupt kernel mode switches while providing real-time 
transport support for multiple concurrent data streams , 
A yet further advantage of the present invention is 
that it provides dynamic support of variable rate real 
time signal processing by ensuring effectively 
continuous... concurrent data and control signal transfers 
between the PCI bus 20 and system 22. These concurrent 
transfers are preferably managed as independent pairs of 
data streams and control threads of execution where each 
stream is composed of discrete block transfers, 
consisting of one or more bytes of data, through the I/O 
channel controller 26. Data blocks of concurrently 
supported streams are interleaved among one another to 
effectively establish concurrent stream data transport. 

In accordance with the present invention, the 
interleave of data blocks is determined by the 
controller sub-system 22 and, in general, independent of 
the execution of the host processor 12. The interleave 
of data blocks... 

...core 26 reflects the respective demand transport rates 
of the individual data streams as determined from the 
nature of the stream data transported. Consequently, 
where stream data represents a constant data frequency 
audio signal stream , the interleave of data blocks in 
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the audio data stream may vary, though generally 
maintaining a fixed through-put rate reflective of the 
data stream transfer rate independently demanded by an 
audio stream coder/decoder... transfer. 

The host processor establishes a control thread in 
the main memory space 54 by constructing one or more 
linked bus transfer unit (BTU) control blocks to 
associate the stream data as provided in the main memory 
space 54 with ...stream 72 may be subsequently 
transferred again by the digital signal processor to any 
directly connected serial bus peripheral 40, 42, 
Bus transfer unit control blocks that define a 
control thread for transferring the DSP RAM stored data 
stream 69, 80 can be provided in either system main or 
DSP memory 14, 34. Preferably, the DSP control thread 
bus transfer unit control blocks are... 
...36 

to operate by definition independent of one another with 
respect to the use of the DSP RAM 34. 

The DSP bus transfer unit control blocks are 

defined to control the transfer of one or more data 

streams through and under the control of the I/O channel 

controller core 62 to any of the external hardware 

interfaces 74 of the serial and...l 16 to the memory 122 or external 

hardware 124 of the controller peripheral layer 118. 

The device driver 1 12 establishes one or more BTU 
control blocks within the memory 106 defining a control 
thread necessary to implement the data stream transfer. 

The initial BTU of the control thread is programmed by 

the device driver 1 12 into the I/O channel controller 

116 and a signal.. .channel controller core 

116, Preferably, the message, is acknowledged by the DSP 

120, the requested function is initialized, a portion of 

the externally provided data stream is received and 

processed by the DSP 120 and stored in a known address 

block within the memory 122, An enable signal 136 is 

then provided to the I/O channel controller core 1 16 to 

enable the initial DSP side stream is received and processed into 

another known 

address block within the memory 122 that is referenced 
by the next DSP side BTU. Again, once this address 
block has been filled with processed data, the... 



...136, In general, the two address blocks in 
the memory 122 are successively alternatingly used in 
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successive alternation for the processing of a single 
data stream . 

-29 

When the address block specified by the current 
host side BTU in the memory 106 is filled by the 
processed incoming data stream, the BTU is complete and 
the... 
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Detailed Description 
... streams enables 

the effectively concurrent parallel transfers of digital 

data streams to and through the channel controller. 
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This substantially alleviates the need for host 
application context switches and most host user to 
interrupt kernel mode switches while providing real-time 
transport support for multiple concurrent data streams. 

A yet further advantage of the present invention is 
that it provides dynamic support of variable rate real 
time signal processing by ensuring effectively 
continuous... concurrent data and control signal transfers 
between the PCI bus 20 and system 22. These concurrent 
transfers are preferably managed as independent pairs of 
data streams and control threads of execution where each 
stream is composed of discrete block transfers, 
consisting of one or more bytes of data, through the I/O 
channel controller 26. Data blocks of concurrently 
supported streams are interleaved among one another to 
effectively establish concurrent stream data transport. 

In accordance with the present invention, the 
interleave of data blocks is determined by the 
- 14 

controller sub-system 22 and, in general, independent of 
the execution of the host processor 12. The interleave 
of data... 

...26 reflects the respective demand transport rates 
of the individual data streams as determined from the 
nature of the stream data transported. Consequently, 
where stream data represents a constant data frequency 
audio signal stream , the interleave of data blocks in 
the audio data stream may vary, though generally 
maintaining a fixed through-put rate reflective of the 
data stream transfer rate independently demanded by an 
audio stream coder/decoder... transfer. 

The host processor establishes a control thread in 
the main memory space 54 by constructing one or more 
linked bus transfer unit (BTU) control blocks to 
associate the stream data as provided in the main memory 
space 54 with the control thread. The host processor 
12, in execution of the I/O channel controller... 

...stream 72 may be subsequently 
transferred again by the digital signal processor to any 
directly connected serial bus peripheral 40, 42. 

Bus transfer unit control blocks that define a 

control thread for transferring the DSP RAM stored data 

stream 69, 80 can be provided in either system main or 
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. DSP memory 14, 34. Preferably, the DSP control thread 
bus transfer unit control blocks are... 

...36 

to operate by definition independent of one another with 
respect to the use of the DSP RAM 34. 

The DSP bus transfer unit control blocks are 
defined to control the transfer of one or more data 
streams through and under the control of the I/O channel 
-22 

controller core 62 to any of the external hardware 

interfaces 74 of the serial... 1 16 to the memory 122 or external 

hardware 124 of the controller peripheral layer 118. 

The device driver 112 establishes one or more BTU 
control blocks within the memory 106 defining a control 
3b thread necessary to implement the data stream transfer. 

The initial BTU of the control thread is programmed by 

the device driver 1 12 into the I/O channel controller 

116 and a signal... channel controller core 

116. Preferably, the message is acknowledged by the DSP 

120, the requested function is initialized, a portion of 

the externally provided data stream is received and 

processed by the DSP 120 and stored in a known address 

block within the memory 122. An enable signal 136 is 

then provided to the 1/ ...DSP side BTU is loaded into the I/O 

channel controller core 116 and left disabled. 

Meanwhile, a next portion of the externally provided 
data stream is received and processed into another known 
address block within the memory 122 that is referenced 
by the next DSP side BTU. Again, once this address 
block has been filled with processed data, the... 

...136. In general, the two address blocks in 
the memory 122 are successively alternatingly used in 
successive alternation for the processing of a single 
data stream . 

When the address block specified by the current 
host side BTU in the memory 106 is filled by the 
processed incoming data stream, the BTU is complete and 
the... 

Claim 

... second interfaces; 
c) a buffer memory coupled between said first 
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buffer and each of said second buffers, said buffer 
memory including a plurality of memory blocks permitting 
storage of third variable segment portions of the 
multiple data streams ; and 
d) a stream controller coupled to said first 
interface, to each of said second interfaces, and to said 
buffer memory, said controller being responsive to the 
respective data lengths of said second variable segment 
portions, said stream controller selectively enabling 
transfers of data between a predetermined memory block of 
said buffer memory and a predetermined second buffer. 

6 The data transfer controller of claim 5 wherein 
said stream controller is responsive to the respective 
data lengths of said third variable segment portions, 
said stream controller selectively enabling transfers of 
data between said first buffer and said plurality of 
memory blocks so as to support the availability of data 
in said predetermined memory block subject to the 
transfers of data with respect to said predetermined 
2.5 second buffer. 

7 The data transfer controller of Claim 6 wherein 
said stream controller supports the availability of data 
in each of said memory blocks subject to the combined 
transfers of data with respect to said second buffers. 
B. The data transfer controller of Claim 7 wherein 
said stream controller... 

...with the respective data 

transfers of said second variable segment portions 
through said second buffers. 

9 The data transfer controller of claim 8 wherein 
said stream controller operates sets of one or more of 
said memory blocks as independent circular FIFO buffers. 

10 The data transfer controller of Claim 9 wherein 
said first and second buffers are FIFO buffers. 

11 A computer... 
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... Prevents moving a page of data out of memory 
by virtual memory processing. 

System Call - A function call to the operating system. 

This causes a context switch to the operating system, 
System Level - The operating system. System level code 
normally uses the privileged mode of the CPU operation which 
allows the full instruction set of the processor to be used. 

a User-level - Applications. User-level code normally 

does not use the privileged mode of the CPU operation, so a 

restricted set of instructions is allowed. 

DMA is a common technique for routing data directly between 
memory and an I/O device without requiring intervention by the 
CPU to...atomicity of DMA 

transfer initiations, to create virtual memory mappings, and to 
maintain memory proxy mappings during virtual memory paging. 
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The first invariant, II must hold regardless of source and 
destination devices. 
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1 1 : If a LOAD instruction initiates a UDMA data transfer, 
then the destination address and the byte count must have been 
STOPvEd by the same process, 
The next three invariants... 
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... the process for execution to 
the master processor. Execution of the transferred 
process resumes on the master processor substantially at a 
point at which execution stopped on the slave processor., 
Illustrativelyt execution of an interrupted instruction of 
the process is either restarted or merely completed on 
master processor. In the former cases the attempt to 
enter privileged mode is repeated on...the invention. 



Typicallyr the 3B15 system uses two modesr or 
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. levelsr of operation of the WE 32100 microprocessor: a 
user mode for executing user program instructionsr and a 
privileged mode for executing functions such as operating 
system instructions that have the potential for corrupting 
shared system resources such as hardware resources. There 
are two mechanisms for entering privileged mode from user 
mode: a process switch mechanism and a system call 
mechanism. The system call mechanism is also known by 
names such as a supervisory call and an ...a means of controlled entry 
into a 

35 function by installing on a processor a new processor 
status word (PSW) and program counter (PC). The system 
call mechanism is used by explicit operating system GATE 
calls to transfer control from user to privileged mode , 
and is; also used to handle 'normal' system exceptions. 

('Quick' interrupts, which would also be handled by this 
mechanism, are not used by the UNIX... 

...exception is an error condition--a fault 
or a trap-other than an interrupt. Normal exceptions 
constitute most system exceptions. The normal exception 
handlers are privileged - mode functions. 

The system call mechanism uses the execution 
stack of the present process; that is, a normal exception 
handler or a function called via a GATE instruction uses 
for... 

...privileged state will not become 
corrupted. Therefore, prior to making a legal entry of a 
privileged function, i.e., before executing a transfer to 
the privileged mode upon the occurrence of a normal 
exception or a GATE requestr the system call mechanism 
checks the present execution stack pointer value against 
the execution stack boundary values that are stored in the 
process control block of the presently... processor 12, or execution of 
the partially-executed 

instruction is merely completed on master processor 12. 

Unless the condition that caused the attempt to enter 
privileged mode was a transient faultr execution of that 
instruction on master processor 12 results in the 
invocation of the operating system service. That service 
is then provided in a conventional, uniprocessor, manner 
on master... the physical address of the slave's 
initial process control block). Also at step 502, a 
process switch to the slave's initial process control 
block is performed by the conventional PETPS instruction . 
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As part of this process switch, the slave's process 
control block pointer register is set to the physical 
address of the intermediate portion of the slave's initial 
process control block, the initial program counter is... 

...step 503, which resets the 
process control block pointer register to the virtual 
address of the intermediate portion of the slave's initial 
process control blockf and executes the conventional 
ENBVJMP instruction which enables virtual addressing for 
slave processor 25 and transfers to the virtual address 
contained in register rO. 

The slave virtual-mode startup routine is... This is implemented by 
calling a slave delete routine (see 
FIG* 14). at step 1002a Illustratively in this exampler 
because the address of the faulting instruction is already 
saved-in the process control block as result of stack 
exception handlingr the faulting instruction is reexecuted 
once the process restarts execution on master 
processor 12. This feature is fairly typical of systems 

with demand-paged memory management. This results.. .routine (see FIG. 13) 

must be invoked to avoid restart problems. This is 

illustratively accomplished, at step 1001, by removing the 

address of the faulting instruction from the process 

control block and substituting therefor the starting 

address of the instruction restart routiner and storing 

the removed faulting instruction's address in a variable. 

When the instruction restart routine completesr it 
restores the faulting instruction 's address in the process 
control block . 

Although far less usual than normal exceptions, 
another exception condition is possible when executing 
user-mode processes: system error. This category includes 
things like alignment... 

...as the standard, master, stack exception handler 
deals with MAU restart problems. The instruction restart 
routine then restores the address of the user process' 
faulted instruction to the process control block of the 
faulted process to cause execution of the faulted 
instruction , at step 1203p and returns at step 1204 to 
execution of that instruction. 
FIG* 14 flowcharts the slave delete routine. 
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This routine is invoked on.. 
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Abstract 

Recent developments have greatly reduced network 
latencies in multiprocessor networks. Thus, software 
overhead is becoming the primary cost of multiproces- 
sor communication. This paper proposes data stream- 
ing — a technique which places explicit send and re- 
ceive instructions in the user code — as a means to 
cut software overhead to a minimum. Data streaming 
has the added benefit that it can tighten the coupling 
between processors by reducing the message size to that 
of a single data item. This paper presents experimen- 
tal results that indicate data streaming can cut soft- 
ware overhead to less than one instruction per byte of 
data transmitted. 

1 Introduction 

Advances in multiprocessor interconnection net- 
works have greatly reduced the network latency in 
massively parallel computer architectures^, 13, 12]. 
Software overhead has emerged as the primary cost 
of multiprocessor communication. Traditional mes- 
sage passing uses interrupts and operating system 
calls to handle communication. This approach is very 
costly in software overhead due to the required context 
switches, type matching operations, and buffering. 

An alternative approach is to allow the messages 
themselves to carry pointers[14] (continuations [11]) to 
user specified message handling routines. These mes- 
sages are called active messages[\4]. Active messages 
have the advantage that every message can be serviced 
immediately and no operating system buffering is nec- 
essary. In addition, costly type matching searches are 
eliminated. However, interrupting the current thread 
of control requires a context switch and, in some cases, 
a trap to kernel mode[14]. Such context switches can 
be costly on modern microprocessors. 

The fundamental problem is that message traffic 
is unpredictable (or assumed to be) in such program- 
ming models. The overhead present in software and 
hardware is the cost of run-time message disambigua- 
tion. If we can statically determine message arrival 
orders and schedule message traffic at compile-time, 
we can eliminate these costs. In those cases where 
messages cannot be disambiguated at compile time, 



we need architectural constructs to allow them to co- 
exist in the network without introducing ambiguities. 
This paper proposes data streaming, a set of architec- 
tural and programming techniques that imbed explicit 
send and receive instructions directly in the user's pro- 
gram. Each send instruction is statically matched to 
a receive instruction or a set of "equivalent" receive 
instructions on another node. To allow multiple com- 
munication patterns to overlap within the network, 
distinct virtual networks are supported in hardware 
that can be mapped to data streams at compile-time. 

This paper presents the theoretical foundations and 
some experimental support for data streaming. Ex- 
perimental results indicate that data streaming can 
significantly reduce the costs of communication. Soft- 
ware overhead using data streaming can be reduced 
to less than one instruction per byte of data trans- 
mitted. In addition, for two of the programs exam- 
ined, software overhead and unhidden network latency 
were significantly less than the penalty suffered by lo- 
cal cache misses. In all programs examined, the total 
cost of communication (software overhead and unhid- 
den network latency) was less than 8 cycles per byte 
transmitted. 

2 Stream-based communication 

Traditional message passing techniques had large 
startup and overhead-per-word costs. It was neces- 
sary to amortize the startup costs by keeping mes- 
sage sizes large and minimizing the communication 
between non-local threads. While minimizing com- 
munication is still a good idea, sending large blocks of 
data limits parallelism that can be exploited within a 
block. For example, during a vector operation, mes- 
sage passing programs typically receive the non-local 
vector in full, apply the operation, and send the re- 
sulting vector to its next destination. A finer grained 
approach would begin vector operations as the individ- 
ual vector components arrived. Thus, the operation on 
the vector can be overlapped with the transmission of 
the vector from its source. What's more, the resulting 
vector can begin its journey to its next destination as 
the results are computed, creating a pipeline of pro- 
cessors all working on different elements of the same 
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vector. The values are streamed into and out of each 
processing element. There is minimal need for buffer- 
ing and, in some instances, values are never placed in 
local memory at ail, passing directly from network to 
register to network. 

2.1 Architectural support 

Figure 1 shows a diagram of a data streaming mul- 
tiprocessor. In the diagram, the processor (P) commu- 
nicates with other processors via a network interface 
unit (NIU) and a number of routers. The NIU has 
multiple virtual channels to and from the local router. 
The NIU is nothing more than a virtual channel multi- 
plexer that controls which virtual channels are visible 
to the CPU for send and receive instructions. Via 
NIU coprocessor instructions, the CPU can select in- 
put from any of the incoming virtual channels. The 
channels are independent streams — if one becomes 
blocked other channels can continue to transmit and 
receive data. In this way, multiple communication pat- 
terns can be overlapped. Stream data is received a 
word at a time as the data is needed. Every circuit 
that the processor creates has an associated stream 




Figure 1: A node in a multiprocessor network 

descriptor which specifies the virtual network the cir- 
cuit uses. If the required channel at the destination 
node is occupied, the routing header will block in the 
network. 

The following instruction set enhancements are pro- 
posed to support data streaming: 

set_send_sd Srs — Set the stream descriptor of 
the output stream to the contents of register $rs. 
If this descriptor is active, the next send will out- 
put a data word to the active stream. Otherwise, 
the next send will specify the destination of the 
new stream. 

send Srs — Send the contents of register $rs to 
the current output stream. If the required output 
channel's buffer is full, the instruction stalls the 
CPU. 

sende Srs — Send the contents of register $rs 
to the current output stream and end the circuit 
(i.e. generate the tail flit). Stall CPU on buffer 
full. 



set_recv_sd Srs — Set the stream descriptor of 
the input stream to the contents of the $rs regis- 
ter. The next receive instruction will attempt to 
read data from this virtual channel. 

recv Srs — Receive a word of data into $rs from 
the current input stream. If no data is present, 
stall until data for that stream arrives. 
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Figure 2: A disjoint stream router 

The network routers will implement a disjoint 
stream network where each virtual channel belongs 
to a virtual network called a stream net No routing 
is allowed between these stream nets. Thus, messages 
with different stream descriptors cannot interfere with 
each other in the network. The router for a disjoint 
stream network is much simpler than a more general 
router for a network with v general purpose virtual 
lanes on each link. In an n dimensional mesh, a dis- 
joint stream network would require v n xn crossbars 
as opposed toavnx vn crossbar which is about v 
times larger. 

3 Deadlock-freedom 

Data streaming is not a trivial extension of cur- 
rent parallel programming techniques. This section 
will briefly describe conditions that are sufficient to 
guarantee deadlock-free, correct operation. A com- 
plete development can be found in [7]. 

Process equivalence: Two circuits are process equiv- 
alent if they are processed by a processor in the net- 
work in exactly the same way. That is, the processor 
has no need to distinguish between them, and thus, 
their arrival order is unimportant. For example, in a 
global sum reduction each circuit is process equivalent. 
All results are added together regardless of arrival or- 
der. Any circuits that are process equivalent destined 
for the same node can be mapped to the same stream 
net since there is no need to distinguish between them. 

Communication patterns: A communication pat- 
tern is a group of circuits that are all mapped to the 
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same stream net and may co-exist in the network at 
the same time. 

Send-receive disjoint: A section of processor code 
is send-receive disjoint if the sends and receives for 
outgoing and incoming circuits are not interleaved. 

Guaranteed routing: A communication pattern is 
guaranteed routing if every circuit in the pattern can 
Be routed before the first receive for that pattern is 
issued. For example, any pattern in which there is 
no contention for links (i.e. the routes are disjoint) is 
guaranteed routing. 

Buffer conservation: A communication pattern is 
strictly buffer conservative if each processor in the pat- 
tern cannot stall on a send before it receives every data 
item destined for it in the pattern. Thus, a processor 
must issue every receive instruction for a pattern be- 
fore it may issue a send that could overflow the buffer 
space for a circuit. In general, strict buffer conserva- 
tion ensures that processors sink enough data to guar- 
antee forward progress of all circuits in the communi- 
cation pattern. 

A communication pattern is leniently buffer con- 
servative if each processor in the pattern cannot stall 
on a send unless the sum of the number of receives 
and the buffer space are greater than the number of 
sends it has issued. When a communication pattern 
is guaranteed routing, lenient buffer conservation is 
sufficient to ensure forward progress of the communi- 
cation pattern because any processor stalls are only 
due to buffer constraints. Thus, if the overall buffer 
pool of the communication pattern does not overflow, 
at least one processor will be able to advance at any 
given time. 

Correct, deadlock- free communication can be en- 
sured by mapping circuits that are not process equiv- 
alent to different communication patterns. Commu- 
nication patterns may overlap in time provided they 
are in different stream nets and processor communi- 
cation activity in different patterns is strictly ordered. 
In general, processor code should be send-receive dis- 
joint and strictly buffer conservative for a communi- 
cation pattern. If the pattern is a guaranteed routing 
pattern, the code may be only leniently buffer conser- 
vative and is not required to be send-receive disjoint. 



4 Simulation model 

I performed simulation studies using a MIPS R2000 
based multiprocessor simulator. The base machine 
was a 64-node (8x8), 2-dimensional mesh with MIPS 
R2000 processors at each node. Each processor had 
a 32kB, 4-word- per-line, direct-mapped, write-back, 
unified cache. 

The network was fully simulated. Network links 
were full-duplex with 4-virtual channels per unidi- 
rectional physical link. The routing algorithm was 
deterministic e-cube routing[4, 9]. Thus, a total of 
four disjoint stream nets were available to the par- 
allel programs. The network flow-control mechanism 
was wormhole routing using 16-bit flow-control digits 
(flit*)- 



Table 1: Benchmark results 



Benchmark 


Cycles 


Memory stalls 


Network stalls 


MM par 


51,462 


1,028.0 


17.2 


MM seq 


2,862,227 


133,190.0 


0.0 


mw 


7,035 


2,272.2 


283.5 


JWseq 


53,449 


3,660.0 


0.0 


Gauss par 


893,673 


8,669.8 


492,043.4 


Gauss seq 


14,704,485 


1,966,150.0 


0.0 


LC/par 


459,554 


4,130.9 


8,396.6 


itfseq 


15,289,994 


3,881,600.0 


0.0 



Table 2: Software overhead 



Benchmark 


lotal oytes 


Overhead instr 


Instr/Byte 


MM 


1,032,448 


515,359 


0.5O0 


FFT 


12,288 


10,445 


0.850 


Gauss 


4,386,616 


2,389,376 


0.545 


LU 


2,747,136 


1,385,799 


0.504 



5 Benchmarks 

I implemented four algorithms using data stream- 
ing to demonstrate its effectiveness. The benchmarks 
used were matrix multiply {MM)> fast Fourier trans- 
form (FFT), Gaussian elimination (Gauss), and LU 
decomposition. (LU). Each benchmark was optimized 
for data streaming. A more complete description of 
the benchmarks and their communication patterns is 
available in [7]. 

6 Results 

Tables 1 and 2 tables summarize the results of the 
simulation studies on the benchmarks. 

Overhead instructions were the total of all instruc- 
tions used for network communication and for destina- 
tion address calculation. We see that for each bench- 
mark, the software overhead per byte of data transmit- 
ted was less than 0.9 instructions. This is a result of 
placing the network instructions directly into the pro- 
gram code and eliminating the buffering and context 
switch associated with library based communication 
methods. 

The benchmarks display a variety of behaviors. 
First, MM is clearly the best benchmark for paral- 
lelization and for data streaming. The flow of data 
was easily arranged so that network latency was over- 
lapped with useful instructions. As a result, on the 
average, PEs spent only 17.22 cycles idling on net- 
work instructions. The FFT algorithm is also very 
amenable to parallelization, albeit less so that MM. 
The problem distribution is not well balanced and, 
thus, PEs stall on network instructions an average of 
283.5 cycles out of 7,035. Gauss proved to be the most 
difficult benchmark to parallelize. The algorithm is 
strongly sequential in nature and is inherently unbal- 
anced. The available parallelism declines linearly dur- 
ing the forward elimination phase as rows are elim- 
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mated. As a result, the average network stall cycle 
count was over half of the total execution time. How- 
ever, data streaming does reduce the software over- 
head associated with the algorithm. Interestingly, L U 
proved well suited to the techniques of data stream- 
ing. Despite the fact that Gauss and LU are similar 
in the types of operations performed, the well struc- 
tured nature of L U greatly enhanced performance, the 
simplicity of the code and of the communication pat- 
terns. The number of network stall cycles for LU was 
8,396.6 cycles per processor which was approximately 
twice the cache miss penalty. 

It is interesting to note that the network stall cycles 
for Gauss and L U were not primarily due to communi- 
cation latency. The stall cycles were primarily a result 
of load imbalance. The average circuit setup time for 
Gauss was 7.96 cycles and 8.56 cycles for LU. 

7 Conclusions and future work 

This paper has presented data streaming, a tech- 
nique for very efficient communication in parallel ar- 
chitectures. Data streaming actieives verv high per- 
formance communication by minimizing the software 
overhead required to transmit and receive data. Ex- 
perimental results indicate that software overhead is 
less than one instruction per byte of data transmit- 
ted. This is significantly less than that achieved by 
other techniques[14, 11]. The cost of communication 
in terms of software overhead and in terms of idling 
due to network latency was lower for both MM and 
FFT than the memory stalls due to cache misses — 
indicating a very tight coupling between processors. 
The efficiency of data-streaming allowed a fast im- 
plementation of Gauss despite the strongly sequential 
nature of the algorithm. Another matrix operation 
benchmark LU proved very ammenable to the tech- 
niques of data streaming. The low software overhead 
of data streaming allows parallel programs to operate 
on finer-grained problems. 

The architectural support given in this paper re- 
quires no microprocessor modification. However, 
modifying the microprocessor itself can also enhance 
the benefits of data streaming. The network inter- 
face could be implemented as a functional unit of a 
dynamic superscalar processor. In this case, the pro- 
cessor need not stall on blocked sends and unsatisfied 
receives. Instructions could be issued out-of-order al- 
lowing register reservations to preserve the data de- 
pendencies in the code. The implications of dynamic 
superscalar implementation of data-streaming need to 
be explored. 
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Abstract 

To support thread-based real-time computing, we 
propose a new multithreading architecture which includes 
virtual threads and scheduling event upcalls. We have 
implemented this architecture in the Arx real-time 
operating system. Our experimental results show that our 
user-level thread schemes outperform kernel threads found 
in important commercial operating systems without 
compromising the benefits of user-level threads. 

I . INTRODUCTION 

In modern operating systems, threads have emerged as 
an essential mechanism for structuring applications and 
representing concurrency [1]. They are extensively 
substituting traditional UNIX-like processes. More and 
more operating systems have been modified, and even 
created to support threaded programming. This trend is not 
an exception in real-time operating systems. POSIX 
specifies a priority-driven thread model in its real-time 
thread extension standards [6]. The reason behind such 
popularity is clear. Not only can threads offer the desired 
level of abstraction for representing concurrency, but also 
it can support efficient resource management and fast 
context switching. This helps real-time operating systems 
provide short and bounded dispatch latency for real-tune 
threads. Moreover, threads allow programmers to structure 
programs such that multiple asynchronous events in 
programs can be effectively handled. 

Threads can be implemented in various ways and 
incorporated into different places in a programming 
environment . They can be implemented as either a set of 
kernel functionalities or a set of library functions. In 
conventional programming, user-level threads are much 
more favored than kernel-level threads mainly for two 
reasons [1): (1) kernel-level threads provide wrong 
abstraction for a programming environment, since they 
require that every application should be mapped onto a 
single thread implementation that the kernel provides; and 
(2) kernel-level threads incur more overhead than user- 
level threads, since with the kernel-level threads, 
processes must cross an extra user/kernel protection 
boundary for each thread operation. 

On the other hand, in real-time programming, kernel- 
level threads are more often the choice of implementation 
than user-level threads, albeit those disadvantages of the 
kernel-level threads [7]. The kernel-level implementation 
can simplify thread scheduling and signal handling, which 
is a crucial property for real-time processing. With no 
kernel support, the user-level threads cannot deal with 
kernel-level thread blocking and signal propagation. Thus, 
when a user-level thread is blocked in kernel mode via a 
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Figure 1 Overall architecture of the Arx kernel 

multiprocessor issues and do not properly address real- 
time systems issues. For example, in [I], a costly 
scheduler activation is required for each event notification, 
since its user-level stack carries upcall arguments. Thus, if 
a new event occurs while an event is being processed in 
one scheduler activation, the kernel must create another 
and make an upcall interrupting the current scheduler 
activation. Though this scheme works well in 
multiprocessor systems, it incurs too much overhead for 
single processor embedded systems. Thus real-time 
operating systems have continued to support only kernel- 
level threads [7]. 

In this paper, we propose a new user-level 
multithreading scheme that is suitable for embedded real- 
time systems. It consists of dynamic virtual thread binding 
and scheduling event upcall mechanisms. A virtual thread 
is a kernel-level incarnation of a user-level thread that 
provides for a kernel-level execution environment. It is a 
passive entity which is temporarily bound to a user-level 
thread, when necessary. It is never scheduled by the 
kcmel. The scheduling event upcall mechanism enables 
the kcmel to notify user processes of kernel events such as 
thread blocking/unblocking and timer expiration. It thus 
allows real-time applications to preemptively schedule 
their own threads, as the kernel schedules kernel threads. 

II. ARX MULTITHREADING SCHEME 

The thread blocking anomaly of user-level threads 
occur since (1) scheduling events are not propagated to a 
user-level scheduler and (2) the blocking thread is put onto 
a sleep queue keeping the kernel stack designated to its 
process. We present two kernel mechanisms to solve these 
problems, respectively. They are scheduling event upcalls 
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and dynamic virtual thread binding. In designing these 
schemes, we put our greatest emphasis on reducing their 
overhead so that we can realize preemptive user-level 
threads for embedded real-time systems at an affordable 
cost. 

A. Dynamic Virtual Thread Binding 
A virtual thread plays a similar role to a kernel-level 
thread. It provides a user-level thread with a kernel stack 
and the kernel with an identifier to distinguish user-level 
threads running or being blocked in kernel mode. Unlike a 
kernel thread, however, it is a dynamic and passive entity. 
A virtual thread is just a kernel-level incarnation of a 
user- level thread. A user-level thread gets bound to a 
virtual thread when it enters into the kernel making a 
system call. Since a virtual thread is a passive entity, it 
does not multiplex user-level threads or it is not scheduled 
by the kernel. 

It consists of a virtual thread identifier which is unique 
in a process, its state, and a kernel stack on which a user- 
level thread executes a system call. When a user-level 
thread gets blocked in the kernel, the associated virtual 
thread is enqueued onto the sleep channel. For each 
process, the kernel also maintains various fields in the 
process structure such as a current virtual thread, a pool of 
reserved virtual threads, and a list of active virtual threads 
bound to user-level threads. The kernel can identify user- 
level threads only through their bound virtual threads and 
does not affect thread scheduling at user level. 
Consequently, a user-level scheduler takes complete 
control over thread scheduling with the aid of scheduling 
event upcalls. 

B. Scheduling Event Upcalls 
For an application to control its threads at user level, it 
should be able to take appropriate actions on kernel events 
such as thread blocking/unblocking and timer expiration. 
To efficiently propagate such kernel events to an 
application, we propose a scheduling event upcall 
mechanism In this scheme, a multithreaded process has a 
single vector entry to a user-level scheduler which 
processes scheduling events. Upon the occurrence of a 
scheduling event, the kernel upcalls a user-level scheduler 
to notify the event. Unlike normal function calls, upcall 
arguments are not pushed onto the stack of a user-level 
scheduler. Instead, they are passed through the event 
queue located in a kernel/user shared data structure we call 
a scheduler control block (SCB). An SCB is mapped in the 
user's address space. Since. events are queued by the kernel 
regardless of the execution state of the process, they can 
be delivered with a low latency even in a heavily loaded 
environment. 

Event notification is carried out as follows. When an 
event occurs, the kernel first saves the context of the 
current thread. This is done differently according to the 
state of the current thread. 

• When an event occurs while the current thread runs 
in kernel mode, the kernel takes out a virtual thread 
from the pool of reserved virtual threads (or creates a 
new one if it is empty), attaches it. to the process, and 
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Figure 2 Scheduling event upcall example: when a 
thread blocks in a system call waiting for an. I/O event. 

makes it a current virtual thread. It then puts the old 
virtual thread identifier into the "interrupted thread 
context" field in SCB. 

• Otherwise, the kernel copies the user's context onto 
the user stack, adjusts the old stack pointer in the 
register file, and saves it into the "interrupted thread 
context" field of SCB. (We assume that the thread 
package always saves a thread context onto its stack.) 

After saving the context, the kernel makes an event, 
fills it with upcall arguments, and enqueues it into the 
event queue. If a user-level scheduler needs to be invoked, 
then the kernel builds its own return frame with the entry 
point of the user-level scheduler and its stack pointer. 
Finally, control is transferred to user mode through the 
return frame. 

When a user-level scheduler is called by the kernel, it 
repeatedly takes out events from the event queue, until it 
empties the queue. For each event, it performs event- 
specific actions such as priority adjustment in priority 
scheduling. After consuming all events in the queue, it 
determines a thread to run next, and actually dispatches it. 

As an illustrative example, Figure 2 shows the upcall 
procedure when a thread blocks in a system call waiting 
for an I/O event. 

C. Event Types and Upcall Arguments 
Our upcall scheme supports six types of scheduling events 
as follows. 

• BLOCK: When a thread blocks in the kernel waiting 
for an event such as the completion of an I/O 
operation, the kernel immediately upcalls a 
BLOCK, event along with the associated virtual 
thread identifier. For this type of event, the "interrupt 
thread context" field contains no meaningful 
information since no thread is interrupted by this 
upcall. 

• WAKEUP: When a blocked thread wakes up in the 
kernel, a WAKEUP event is sent to a user-level 
scheduler along with the associated virtual thread 
identifier. The kernel also supplies the context 
information of the interrupted thread via SCB. 

• TIMER: When a timer preset by a user-level 
scheduler expires, a TIMER event along with a timer 
delay value is sent to a user-level scheduler. When 
the kernel puts the event into the queue, it calculates 
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the difference between the requested timer value and 
the current time and writes it into the event. 

• SIGNAL: When a signal is delivered to a process, 
the kernel upcalls a SIGNAL event along with 
signal information stored in the siginfo structure 
of the process structure. 

• EXIT: When a process terminates via an exitQor 
exec() system call, the kernel sends this event to a 
user-level scheduler.. 

• RESUME: When a process is resumed after 
preemption, the kernel sends a user-level scheduler a 
RESUME upcall along with the duration of 
preemption. This event is used by the user-level 
scheduler to keep track of thread execution time. 

D. Lock-Free Kernel-User Synchronization 
Since the event queue is shared between a user-level 
scheduler and the kernel, there is a race between them. A 
naive solution to this synchronization problem is to use 
kernel supported locking primitives. However, this scheme 
is too expensive since it requires a user-level scheduler 
should make a system call every invocation. Instead, we 
use a lock-free scheme for synchronization between the 
kernel and a user process. In our scheme, we specially 
design a user-level scheduler in the following manner. 

• A user-level scheduler consists of two subsections. 
In the upper section, it greedily consumes events in 
the event queue, until the queue gets empty. In the 
lower subsection, it selects a thread to run next. 

• The execution of the lower subsection can be safely 
aborted. To make this possible, a user-level scheduler 
has its own private user stack but no thread control 
block. 

In the SCB of a user process, there is the lock field 
which is set by a user-level thread when it executes non- 
reentrant code or wants to disable context switching. 
When an event occurs, the kernel can always enqueue it 
into the event queue regardless of the state of the user 
process. Before the kernel finishes an upcall, it checks if a 
user-scheduler should be invoked. It invokes a user-level 
scheduler, only if the lock value is zero and if the event 
queue was empty at the time of the upcall. The fact that 
the queue was not empty implies the user-level scheduler 
was interrupted by the kernel's upcall handler while 
dequeuing events from the queue. Thus, the kernel does 
'not invoke a user-level scheduler in this case. If the queue 
was empty at the time of the upcall, then either a user- 
level thread was running, or a user-level scheduler was 
executing the lower section. In either case, the kernel 
invokes a new user-level scheduler. If there was a running 
user-level scheduler, the new invocation wipes out the 
context of the previous invocation since the new one uses 
the same stack as the old one. This is an entirely safe 
operation, since the user-level scheduler is designed to get 
aborted safely. 

III. EXPERIMENTAL RESULTS 

We have fully implemented the proposed schemes 
along with several other real-time system supports such as 
user-level interrupts, user-level timers, and real-time 



thread interfaces in the Arx real-time operating system 
that we have developed at Seoul National University. Arx 
is a stand-alone operating system possessing a fully 
preemptive real-time kernel, a POSIX compliant thread 
library, and multithread-safe standard I/O libraries. Arx is 
currently up and running on a Pentium PC and 
StrongARM NC (Net-work Computer). It supports the 
VFAT file system, the TCP/IP protocol suite, and the X 11 
R5 window system. Arx was designed to support flexible 
two-level scheduling, versatile multithreading, and 
efficient user-level I/O [3]. Specifically, scheduling in Arx 
is done in a nested fashion such that the kernel schedules 
processes and a user-level scheduler schedules threads in a 
process, as shown in Figure 1 . The kernel scheduler runs 
the weighted fair queuing algorithm or its variants in that 
processes are scheduled and run according to their shares 
of CPU bandwidth [4]. 

We have conducted extensive experiments and 
measurements to show the performance of the proposed 
schemes implemented in the Arx real-time operating 
system. Our experimental machine was a PC with an Intel 
100 MHz Pentium processor and 256K bytes of secondary 
cache. We made our measurements using a 64-bit cycle 
counter implemented in the Intel Pentium processor . 

u. scheduler context thread 
upcall initiated invoked switch resumed 
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3.11us \ 4 *' u * u > 0.85us (user-mode) 



1.51 us (kernel-mode) 



5.98 (or 0.64) us 
Figure 3 Upcall performance 

We first measured an upcall latency. Since the Arx 
kernel relies heavily on upcalls, it is extremely important 
to minimize the upcall latency. Figure 3 shows the typical 
upcall performance in the Arx kernel with a typical user- 
level scheduler implementing fixed-priority scheduling 
with the round-robin policy. The upcall latency defined as 
time between when a kernel upcall handlers starts and 
when a user-level scheduler starts is 3.11 us. In Figure 3, 
time to resume a thread differs only slightly depending on 
whether it is resumed in kernel mode or in user mode. 
This is because the resume() system call is specially 
designed such that it is handled by a dedicated trap 
handjer which skips common signal check operations. 

To show the performance benefits of our 
multithreading scheme over other operating systems, we 
have performed measurements with a multithreaded Java 
program in Arx, Solaris, and Windows 98. In these 
experiments, we used a 266 Pentium-II PC since we failed 
to install Solaris onto the 100 MHz Pentium PC. We also 
used a custom-made timer board with the timer resolution 
of 50ns since we could not access the Pentium event 
counters in user mode. In. our benchmark Java program 
shown in Appendix of [5], each thread iterates 100,000 
times writing its thread id into a global array protected by 
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Figure 4 Performance of thread scheduling in Arx, Solaris, and Windows 98: (A) aggregate execution times of all 
threads, and (B) corresponding numbers of context switches. 



a synchronized object. As a JVM, we ported and used 
Kaffe version 0.10.0 in Arx, Solaris and Windows 98 with 
the JIT mode being enabled. 

We have measured the aggregate times of thread 
executions varying the number of threads. To observe only 
thread scheduling performance, we deliberately excluded 
time to load Java class files. In these experiments, all 
threads in the Java benchmark program had the same 
priority, and they were scheduled in a round robin fashion. 
Figure 4 summarizes the results of the experiments. Figure 
4 (A) shows the aggregate execution times on three 
different operating systems, and Figure 4 (B) shows the 
numbers of context switches. As expected, Arx always 
incurred less scheduling overhead than Solaris and 
Windows 98 even though a lot more context switches 
happened in the benchmark program running on Arx. 

Contrary to our expectations, the performance 
improvement of Arx over Solaris is marginal. This is 
because Solaris implements mutex lock/unlock primitives 
as user-level library functions even for kernel-level 
threads, and because the round robin quantum size is so 
large in Solaris that thread scheduling overhead 
contributes little to the aggregate execution times. Note 
that the quantum size was 1ms in Arx, while that was 
200ms in Solaris. On the other hand, Arx outperformed 
Windows 98 by an order of magnitude. Unlike Solaris, 
Windows implements mutex primitives as system calls. 

We speculate that Arx will outperform Solaris in a 
greater degree as the quantum size is reduced in Solaris. 
We can support this observation using a simple calculation 
as follows. When the number of threads was 20 in Solaris, 
the aggregate thread execution time was 5109ms with the 
quantum size 200ms. If the round robin quantum becomes 
lms, approximately 5,000 additional context switches will 
occur. Since the scheduling overhead measured in Solaris 
is 200us, 1000ms overhead will be added to the overall 
execution time. The increased execution . time will 
recursively increase the number of context switches. 
Consequently, the overall execution time will be increased 
again. We can repeat this process until the overall 
execution time converges. The result is 6,374ms, which is 
25% larger than the original execution time. 



IV. CONCLUSION 

These results show that user-level threads with our 
scheme yields much better performance than kernel-level 
threads while it can keep the desired properties of kernel- 
level threads. 
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